Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a query engine for incremental compilation #355

Merged
merged 5 commits into from
Nov 11, 2024
Merged

Implement a query engine for incremental compilation #355

merged 5 commits into from
Nov 11, 2024

Conversation

mcy
Copy link
Member

@mcy mcy commented Oct 14, 2024

This change adds a new experimental/incremental package, which provides a generic dependency memoization framework, intended for parallelization of compilation operations. It follows the current design for parallelization in the compiler but has the following new features:

  1. Dynamic scheduling of dependency queries.
  2. Persistence of intermediate results within and across queries.
  3. Partial invalidation of results (e.g. for when files change).

This is intended to be used for parallelization and memoization in the LSP and throughout the compiler.

@mcy
Copy link
Member Author

mcy commented Oct 18, 2024

One thing to consider: incremental.Executor leaks memory (in the situation where the executor is long-lived, which is expected). I am contemplating a tradeoff:

  1. Make Invalidate() also evict entries. It currently does not do this because write operations to sync.Map are slow.
    a. Incidentally, we can replace sync.Map with a hand-rolled sharded map instead (because sync.Map kinda sucks lol).
  2. Add something like Evict() to explicitly evict a URL and its dependencies. (or an evict bool to Invalidate().
  3. Do some kind of funny LRU thing to evict old entries?

Comment on lines 49 to 67
// Queries returns a snapshot of the URLs of which queries are present (and
// memoized) in an Executor.
//
// The returned slice is sorted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just for testing? Seems like this could potentially be a big slice to copy. Maybe instead of returning a slice, it should return a iter.Seq[string], and you could push making a sorted slice to the caller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of intended as a general debugging aid. I'm not sure we really need an iterator for this, because you really do want to sort them first and that requires buffering regardless.

//
// If any of the queries fails, or if the [context.Context] passed to the
// [Run] call that spawned the [Task] is cancelled, this function calls
// [runtime.Goexit]. This is not something callers should be concerned about,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wha?? If any single query fails, the whole goroutine disappears? That seems like a very dangerous API. This would be simpler to use if it just used standard error-return control flow. If a query fails, seems like it should return a correspond failed Result. And if the whole task is cancelled, it should return the cancellation cause.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, my gut reaction is "why is this even exported"? I now see it's because it really wants to be a method of Task (like Run wants to be a method of Executor), but it needs to be generic for type safety. Maybe mention something to that effect in the Go docs for each of these, to make their use more clear. In this case, this is expected to be called from Query.Execute to gather dependencies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you know I thought about it and this is kinda nuts. The new API instead has really scary "hey you need to return this error or something will explode".

I also added the docs you suggested.

return nil, context.Cause(ctx)
}

// Now, for each non-failed result, we need to walk their dependencies and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "non-failed" mean in this regard? The loop has no conditional that looks like it's filtering out failed items.

Also, why would we exclude the errors from failed results?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This no longer makes sense with the new error propagation convention.

// Now, for each non-failed result, we need to walk their dependencies and
// collect their errors.
for i, query := range queries {
task := e.getTask(query.URL())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to introduce a limit on the size of the cached results, this could potentially create a new task instead since the one created above might have been evicted. Maybe we instead need to associate these tasks as deps of root, so you can instead iterate root's deps and not worry about concurrent/near-immediate eviction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at the least, pass a flag to getTask to tell it to not create a task and instead return nil. Then update this to ignore nil return values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we punt automatic eviction to a followup?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but it would be nice to first drop a TODO right here so we don't forget that there is a potential risk here.

experimental/incremental/executor.go Show resolved Hide resolved
// queries which depend on it.
//
// This will not cause the query to fail; see [Task.Fail] for that.
func (t *Task) Error(errs ...error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is borrowing heavily from testing.T, but I think we could make it more clear, since this is not test code. Maybe instead this signature could be NonFatalError(...error). Also, it's a little confusing that it allows zero errors to be provided, so maybe even NonFatalError(error, ...error).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's NonFatal now.


for i, q := range queries {
i := i
deps[i] = caller.exec.getTask(q.URL())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I had to add to the main executor in the root protocompile package (since the existing compiler is somewhat similar in how it compiles files) is cycle detection. If, for example, there were incorrect cyclic imports between files, we need to be able to detect that in order to prevent deadlock in this execution engine (where a task is indirectly blocked on itself).

The current compiler does not do this very efficiently (it recursively crawls the entire dependency graph of the caller task to see if it finds the newly requested file). I think a better approach would be to add to each task a "path" of its ancestor tasks, and if we ever see a request to resolve a dependency that already exists in the caller's path (or is equal to the caller), the task should fail fast.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also clearly document that Query.Execute should never call Run, only Resolve. (Otherwise, we'd lose the trail of callers and not be able to detect cycles.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented cycle tracking. There's even a test!

In the process I found a deadlock due to resource starvation. There's some fussiness in Resolve to drop the hold on exec.sema in the right place to avoid starvation.


go func() {
results = Resolve(root, queries...)
close(root.result.done)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be in a defer. Otherwise, if a query fails, the goroutine will exit, but since the context is not yet cancelled and this channel not closed, the select below would hang.

(Having said that, I've left other remarks about how I think the use of runtime.Goexit is dangerous and we should probably revisit how failures are communicated. But this probably needs to be a deferred function no matter how those changes play out.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right regardless, if a panic comes crashing through for whatever reason, it will get swallowed.

if r.Value != nil {
// This type assertion will always succeed, unless the user has
// distinct queries with the same URL, which is a sufficiently
// unrecoverable condition that a panic is acceptable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While internal panics are okay, they would be categorically bad if they happened in the BSR. Since this is running in a separate goroutine, the calling application (such as BSR code) would have no way of recovering.

The current compiler has a top-level deferred function (in the equivalent of this package's Run function, in the go function that launches these goroutines) that catches any panic and reports it as a fatal error, specifically for scenarios like this. That way compiler bugs don't turn into serious operational issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I also think it would be better to just return an error here. (Sorry to keep harping on that. But it's much more straight-forward control flow.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would generally consider this type of panic to be on the same level as a random nil deref. Avoiding panics is not as simple as "never panic"; in this case, it indicates a fairly catastrophic bug on the part of query writers.

How to handle such errors should generally be done by the people who are panic-intolerant, such as with a recover. We can't prevent query authors from panicking out of Execute, after all (unless we try catching that, too).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to handle such errors should generally be done by the people who are panic-intolerant, such as with a recover. We can't prevent query authors from panicking out of Execute, after all (unless we try catching that, too).

The problem with this is that they can only recover from panics on their (callng) goroutine. If the library function they call spawns a goroutine and the panic happens there (like with this), then there is no ability to catch it and prevent it from crashing the app.

So while I understand that we can't eliminate all panics, since nil-deref/index-out-of-range/etc could still occur, we can and should recover from any panic that happens in a goroutine that is created as an internal detail of this package. The existing compiler in the protocompile package does that here.

func Resolve[T any](caller Task, queries ...Query[T]) []Result[T] {
results := make([]Result[T], len(queries))
deps := make([]*task, len(queries))
wg := semaphore.NewWeighted(int64(len(queries)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you call it wg, like "wait group". I guess you are using a semaphore because it is context-aware? Maybe a little comment that we use this instead of sync.WaitGroup specifically because waiting can be interrupted via context cancellation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually added such a comment before I even noticed your comment here!

@jhump
Copy link
Member

jhump commented Oct 23, 2024

we can replace sync.Map with a hand-rolled sharded map instead (because sync.Map kinda sucks lol).

Yeah, I am not a fan of sync.Map at all. Its interface is okay, but its internal behavior feels a bit unpredictable and some things are slow (even though have "amortized" reasoable complexity). I expect a map with simple RWLock will be adequate for our use, and we could move to a sharded data structure if we find there's too much contention on the lock. But I have a feeling that keeping things simple will be fine and sharding now may be premature optimization.

FWIW, there are other open-source cache implementations that are decent and handle the size limits with evicting LRU entries.

mcy added a commit to bufbuild/buf that referenced this pull request Oct 23, 2024
…3403)

Since removing concurrency from the LSP, performance has degraded substantially. This PR fixes some performance issues by being smart about doing less work. Most of this will be mooted by bufbuild/protocompile#355, but until we have intelligent memoization, we can use some dumb heuristics to improve perf.

First, we don't send progress notifications for files that have not been opened by the client's editor. Hammering the Unix socket with notifications is a major source of slowdown, and these notifications are not useful to the user, because they are about files they do not care about.

Second, we don't send diagnostics for the same. These files get reparsed when opened in the editor regardless, so this doesn't risk staleness.

Third, we were previously reindexing imported files once per cross-file ref. This is clearly an oversight on my part, which I suspect was caused due to nasty merge conflicts on my last PR. I noticed because protovalidate/priv/private.proto was getting hammered in the logs -- all because validate.proto references the priv symbol dozens of times. 🤦

After fixing all of these, the LSP went from sluggish to snappy (in VSCode).
@mcy mcy requested a review from jhump October 24, 2024 20:49
//
// Errors that occur during each query are contained within the returned results.
// Unlike [Resolve], these contain the *transitive* errors for each query!
func Run[T any](ctx context.Context, e *Executor, queries ...Query[T]) (results []Result[T], cancelCause error) {
//
// Implementations of [Query].Execute MUST NOT UNDER ANY CIRCUMSTANCES call
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a TODO that we could do a best-effort validation of this by providing a context to Query.Execute. The package could add a context value to it to indicate we're executing a query, and this function could check for that context value and, if present, fail fast.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and did the validation :)

})
results[i].NonFatal = r.NonFatal
results[i].Fatal = r.Fatal
}) || async // Need to avoid short-circuiting here!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could instead use |= up on line 98.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go does not provide |= for bools. I know, right?!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, I've run into this before, too. 🤮

url := q.URL()
fmt.Println(url)
for node.Query != nil {
fmt.Println(node.Query.URL(), url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I suspect this was in here (and a couple of lines above) for debugging and accidentally left in?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops

experimental/incremental/query.go Outdated Show resolved Hide resolved
}

if wg.Acquire(caller.ctx, 1) != nil {
return
return false
}

// Complete the rest of the computation asynchronously.
go func() {
defer wg.Release(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this is where I think we need to put a recover call. That way, panics in user-provided code (i.e. Query.Execute) but also bugs in compiler/library code won't cause the calling application to crash.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ErrPanic!

// Note: this function really wants to be a method of [Task], but it isn't
// because it's generic.
func Resolve[T any](caller Task, queries ...Query[T]) (results []Result[T], expired error) {
caller.checkDone()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Resolve now returns an error, I think it would be better to have this function return an error that can be propagated, instead of panic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I somewhat disagree that this should be an error, but I don't think it deeply matters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no it has to be a panic, otherwise e.g. callers need to check for an error in NonFatal too. I think a panic is better, because it should only ever happen if you do something nasty like escape a Task into a global.

Comment on lines 206 to 209
node := &caller.path
url := q.URL()
fmt.Println(url)
for node.Query != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a multi-statement for loop to make this a little more concise:

for node := &caller.path; node.Query != nil; node = node.Prev {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better yet: path.Walk.

Comment on lines 271 to 272
if closed(r.done) {
done()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be a little more intuitively expressed with an unconditional close(r.done). I think the only places it gets closed are here and then after q.Execute is called.

Then, to detect panics, a construct like this is effective and a little less code/easier to follow IMO:

didPanic := true
defer func() { /* ... */ }()
r.Value, r.Fatal = q.Execute(callee)
didPanic = false

So the deferred function just inspect didPanic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified this significantly another way, please take a look at run and runAsync

@@ -204,21 +266,34 @@ func run[T any](caller Task, task *task, q Query[T], wg *semaphore.Weighted, don
}

defer func() {
callee.done = true
caller.exec.sema.Release(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to do this before the mutation to the task below (line 277, task.result.Store(nil))?

I think after this is called, the calling task may immediately "wake up" and could potentially observe a partial result before the code below clears it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not only is callee.done incorrect, I got rid of Task.done altogether.

Comment on lines 22 to 25
// URLBuilder is a helper for building URLs.
//
// It is a simplified version of the interface of [net/url.URL].
type URLBuilder struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like a real "builder" pattern. Instead, this seems like a URI, not a builder, and the Build() method could just as reasonably be String().

Also, not sure this pays its freight -- it's extra surface area to maintain if/when we make it exported/non-experimental, and it doesn't seem hugely valuable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted it :)

@mcy mcy requested a review from jhump October 30, 2024 17:30
Copy link
Member

@jhump jhump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments, but I think this is really close. So LGTM modulo the few remarks.

return anyQuery{
url: q.URL(),
return &AnyQuery{
key: q.Key(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to also set actual: q.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oopies

experimental/incremental/task.go Outdated Show resolved Hide resolved
experimental/incremental/task.go Outdated Show resolved Hide resolved
experimental/incremental/task.go Outdated Show resolved Hide resolved

if wg.Acquire(caller.ctx, 1) != nil {
// run executes a query in the context of some task and writes the result to out.
func (t *task) run(caller Task, q *AnyQuery, done func(*result)) (async bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this and the one below are confusingly named. runAny actually blocks for the result and isn't really async. This is the async version, since it returns immediately and can spawn a goroutine to actually invoke done.

Maybe rename this one to start and the other to run?

@mcy mcy enabled auto-merge (squash) November 11, 2024 17:56
@mcy mcy disabled auto-merge November 11, 2024 17:56
@mcy mcy enabled auto-merge (squash) November 11, 2024 17:57
@mcy mcy merged commit aa4cf26 into main Nov 11, 2024
8 checks passed
@mcy mcy deleted the mcy/query branch November 11, 2024 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants