Implement a query engine for incremental compilation #355

mcy · 2024-10-14T19:48:37Z

This change adds a new experimental/incremental package, which provides a generic dependency memoization framework, intended for parallelization of compilation operations. It follows the current design for parallelization in the compiler but has the following new features:

Dynamic scheduling of dependency queries.
Persistence of intermediate results within and across queries.
Partial invalidation of results (e.g. for when files change).

This is intended to be used for parallelization and memoization in the LSP and throughout the compiler.

mcy · 2024-10-18T15:40:24Z

One thing to consider: incremental.Executor leaks memory (in the situation where the executor is long-lived, which is expected). I am contemplating a tradeoff:

Make Invalidate() also evict entries. It currently does not do this because write operations to sync.Map are slow.
a. Incidentally, we can replace sync.Map with a hand-rolled sharded map instead (because sync.Map kinda sucks lol).
Add something like Evict() to explicitly evict a URL and its dependencies. (or an evict bool to Invalidate().
Do some kind of funny LRU thing to evict old entries?

jhump · 2024-10-23T13:55:03Z

experimental/incremental/executor.go

+// Queries returns a snapshot of the URLs of which queries are present (and
+// memoized) in an Executor.
+//
+// The returned slice is sorted.


Is this just for testing? Seems like this could potentially be a big slice to copy. Maybe instead of returning a slice, it should return a iter.Seq[string], and you could push making a sorted slice to the caller.

It's kind of intended as a general debugging aid. I'm not sure we really need an iterator for this, because you really do want to sort them first and that requires buffering regardless.

jhump · 2024-10-23T13:59:16Z

experimental/incremental/task.go

+//
+// If any of the queries fails, or if the [context.Context] passed to the
+// [Run] call that spawned the [Task] is cancelled, this function calls
+// [runtime.Goexit]. This is not something callers should be concerned about,


Wha?? If any single query fails, the whole goroutine disappears? That seems like a very dangerous API. This would be simpler to use if it just used standard error-return control flow. If a query fails, seems like it should return a correspond failed Result. And if the whole task is cancelled, it should return the cancellation cause.

Also, my gut reaction is "why is this even exported"? I now see it's because it really wants to be a method of Task (like Run wants to be a method of Executor), but it needs to be generic for type safety. Maybe mention something to that effect in the Go docs for each of these, to make their use more clear. In this case, this is expected to be called from Query.Execute to gather dependencies.

Yeah you know I thought about it and this is kinda nuts. The new API instead has really scary "hey you need to return this error or something will explode".

I also added the docs you suggested.

jhump · 2024-10-23T14:02:18Z

experimental/incremental/executor.go

+		return nil, context.Cause(ctx)
+	}
+
+	// Now, for each non-failed result, we need to walk their dependencies and


What does "non-failed" mean in this regard? The loop has no conditional that looks like it's filtering out failed items.

Also, why would we exclude the errors from failed results?

This no longer makes sense with the new error propagation convention.

jhump · 2024-10-23T14:03:57Z

experimental/incremental/executor.go

+	// Now, for each non-failed result, we need to walk their dependencies and
+	// collect their errors.
+	for i, query := range queries {
+		task := e.getTask(query.URL())


If we were to introduce a limit on the size of the cached results, this could potentially create a new task instead since the one created above might have been evicted. Maybe we instead need to associate these tasks as deps of root, so you can instead iterate root's deps and not worry about concurrent/near-immediate eviction.

Or at the least, pass a flag to getTask to tell it to not create a task and instead return nil. Then update this to ignore nil return values?

Can we punt automatic eviction to a followup?

Sure, but it would be nice to first drop a TODO right here so we don't forget that there is a potential risk here.

experimental/incremental/executor.go

jhump · 2024-10-23T14:32:08Z

experimental/incremental/task.go

+// queries which depend on it.
+//
+// This will not cause the query to fail; see [Task.Fail] for that.
+func (t *Task) Error(errs ...error) {


I see this is borrowing heavily from testing.T, but I think we could make it more clear, since this is not test code. Maybe instead this signature could be NonFatalError(...error). Also, it's a little confusing that it allows zero errors to be provided, so maybe even NonFatalError(error, ...error).

It's NonFatal now.

jhump · 2024-10-23T14:51:40Z

experimental/incremental/task.go

+
+	for i, q := range queries {
+		i := i
+		deps[i] = caller.exec.getTask(q.URL())


One thing I had to add to the main executor in the root protocompile package (since the existing compiler is somewhat similar in how it compiles files) is cycle detection. If, for example, there were incorrect cyclic imports between files, we need to be able to detect that in order to prevent deadlock in this execution engine (where a task is indirectly blocked on itself).

The current compiler does not do this very efficiently (it recursively crawls the entire dependency graph of the caller task to see if it finds the newly requested file). I think a better approach would be to add to each task a "path" of its ancestor tasks, and if we ever see a request to resolve a dependency that already exists in the caller's path (or is equal to the caller), the task should fail fast.

We should also clearly document that Query.Execute should never call Run, only Resolve. (Otherwise, we'd lose the trail of callers and not be able to detect cycles.)

Implemented cycle tracking. There's even a test!

In the process I found a deadlock due to resource starvation. There's some fussiness in Resolve to drop the hold on exec.sema in the right place to avoid starvation.

jhump · 2024-10-23T15:08:29Z

experimental/incremental/executor.go

+
+	go func() {
+		results = Resolve(root, queries...)
+		close(root.result.done)


I think this needs to be in a defer. Otherwise, if a query fails, the goroutine will exit, but since the context is not yet cancelled and this channel not closed, the select below would hang.

(Having said that, I've left other remarks about how I think the use of runtime.Goexit is dangerous and we should probably revisit how failures are communicated. But this probably needs to be a deferred function no matter how those changes play out.)

You're right regardless, if a panic comes crashing through for whatever reason, it will get swallowed.

jhump · 2024-10-23T15:14:28Z

experimental/incremental/task.go

+			if r.Value != nil {
+				// This type assertion will always succeed, unless the user has
+				// distinct queries with the same URL, which is a sufficiently
+				// unrecoverable condition that a panic is acceptable.


While internal panics are okay, they would be categorically bad if they happened in the BSR. Since this is running in a separate goroutine, the calling application (such as BSR code) would have no way of recovering.

The current compiler has a top-level deferred function (in the equivalent of this package's Run function, in the go function that launches these goroutines) that catches any panic and reports it as a fatal error, specifically for scenarios like this. That way compiler bugs don't turn into serious operational issues.

Though I also think it would be better to just return an error here. (Sorry to keep harping on that. But it's much more straight-forward control flow.)

I would generally consider this type of panic to be on the same level as a random nil deref. Avoiding panics is not as simple as "never panic"; in this case, it indicates a fairly catastrophic bug on the part of query writers.

How to handle such errors should generally be done by the people who are panic-intolerant, such as with a recover. We can't prevent query authors from panicking out of Execute, after all (unless we try catching that, too).

How to handle such errors should generally be done by the people who are panic-intolerant, such as with a recover. We can't prevent query authors from panicking out of Execute, after all (unless we try catching that, too).

The problem with this is that they can only recover from panics on their (callng) goroutine. If the library function they call spawns a goroutine and the panic happens there (like with this), then there is no ability to catch it and prevent it from crashing the app.

So while I understand that we can't eliminate all panics, since nil-deref/index-out-of-range/etc could still occur, we can and should recover from any panic that happens in a goroutine that is created as an internal detail of this package. The existing compiler in the protocompile package does that here.

jhump · 2024-10-23T15:22:42Z

experimental/incremental/task.go

+func Resolve[T any](caller Task, queries ...Query[T]) []Result[T] {
+	results := make([]Result[T], len(queries))
+	deps := make([]*task, len(queries))
+	wg := semaphore.NewWeighted(int64(len(queries)))


I see you call it wg, like "wait group". I guess you are using a semaphore because it is context-aware? Maybe a little comment that we use this instead of sync.WaitGroup specifically because waiting can be interrupted via context cancellation?

I actually added such a comment before I even noticed your comment here!

jhump · 2024-10-23T18:09:05Z

we can replace sync.Map with a hand-rolled sharded map instead (because sync.Map kinda sucks lol).

Yeah, I am not a fan of sync.Map at all. Its interface is okay, but its internal behavior feels a bit unpredictable and some things are slow (even though have "amortized" reasoable complexity). I expect a map with simple RWLock will be adequate for our use, and we could move to a sharded data structure if we find there's too much contention on the lock. But I have a feeling that keeping things simple will be fine and sharding now may be premature optimization.

FWIW, there are other open-source cache implementations that are decent and handle the size limits with evicting LRU entries.

…3403) Since removing concurrency from the LSP, performance has degraded substantially. This PR fixes some performance issues by being smart about doing less work. Most of this will be mooted by bufbuild/protocompile#355, but until we have intelligent memoization, we can use some dumb heuristics to improve perf. First, we don't send progress notifications for files that have not been opened by the client's editor. Hammering the Unix socket with notifications is a major source of slowdown, and these notifications are not useful to the user, because they are about files they do not care about. Second, we don't send diagnostics for the same. These files get reparsed when opened in the editor regardless, so this doesn't risk staleness. Third, we were previously reindexing imported files once per cross-file ref. This is clearly an oversight on my part, which I suspect was caused due to nasty merge conflicts on my last PR. I noticed because protovalidate/priv/private.proto was getting hammered in the logs -- all because validate.proto references the priv symbol dozens of times. 🤦 After fixing all of these, the LSP went from sluggish to snappy (in VSCode).

jhump · 2024-10-28T17:28:10Z

experimental/incremental/executor.go

 //
 // Errors that occur during each query are contained within the returned results.
 // Unlike [Resolve], these contain the *transitive* errors for each query!
-func Run[T any](ctx context.Context, e *Executor, queries ...Query[T]) (results []Result[T], cancelCause error) {
+//
+// Implementations of [Query].Execute MUST NOT UNDER ANY CIRCUMSTANCES call


Maybe a TODO that we could do a best-effort validation of this by providing a context to Query.Execute. The package could add a context value to it to indicate we're executing a query, and this function could check for that context value and, if present, fail fast.

I went ahead and did the validation :)

jhump · 2024-10-28T17:35:07Z

experimental/incremental/task.go

-		})
+			results[i].NonFatal = r.NonFatal
+			results[i].Fatal = r.Fatal
+		}) || async // Need to avoid short-circuiting here!


Could instead use |= up on line 98.

Go does not provide |= for bools. I know, right?!

Ah, right, I've run into this before, too. 🤮

jhump · 2024-10-28T17:38:05Z

experimental/incremental/task.go

+		url := q.URL()
+		fmt.Println(url)
+		for node.Query != nil {
+			fmt.Println(node.Query.URL(), url)


Oops, I suspect this was in here (and a couple of lines above) for debugging and accidentally left in?

experimental/incremental/query.go

jhump · 2024-10-28T17:55:43Z

experimental/incremental/task.go

 	}

 	if wg.Acquire(caller.ctx, 1) != nil {
-		return
+		return false
 	}

 	// Complete the rest of the computation asynchronously.
 	go func() {
 		defer wg.Release(1)


FWIW, this is where I think we need to put a recover call. That way, panics in user-provided code (i.e. Query.Execute) but also bugs in compiler/library code won't cause the calling application to crash.

ErrPanic!

jhump · 2024-10-28T20:41:56Z

experimental/incremental/task.go

+// Note: this function really wants to be a method of [Task], but it isn't
+// because it's generic.
+func Resolve[T any](caller Task, queries ...Query[T]) (results []Result[T], expired error) {
+	caller.checkDone()


Since Resolve now returns an error, I think it would be better to have this function return an error that can be propagated, instead of panic.

I somewhat disagree that this should be an error, but I don't think it deeply matters.

Actually no it has to be a panic, otherwise e.g. callers need to check for an error in NonFatal too. I think a panic is better, because it should only ever happen if you do something nasty like escape a Task into a global.

jhump · 2024-10-28T20:49:30Z

experimental/incremental/task.go

+		node := &caller.path
+		url := q.URL()
+		fmt.Println(url)
+		for node.Query != nil {


Could use a multi-statement for loop to make this a little more concise:

for node := &caller.path; node.Query != nil; node = node.Prev {

Better yet: path.Walk.

jhump · 2024-10-28T20:56:44Z

experimental/incremental/task.go

 			if closed(r.done) {
 				done()


I think this would be a little more intuitively expressed with an unconditional close(r.done). I think the only places it gets closed are here and then after q.Execute is called.

Then, to detect panics, a construct like this is effective and a little less code/easier to follow IMO:

didPanic := true defer func() { /* ... */ }() r.Value, r.Fatal = q.Execute(callee) didPanic = false

So the deferred function just inspect didPanic.

I simplified this significantly another way, please take a look at run and runAsync

jhump · 2024-10-28T20:57:50Z

experimental/incremental/task.go

@@ -204,21 +266,34 @@ func run[T any](caller Task, task *task, q Query[T], wg *semaphore.Weighted, don
 		}

 		defer func() {
+			callee.done = true
 			caller.exec.sema.Release(1)


Is it safe to do this before the mutation to the task below (line 277, task.result.Store(nil))?

I think after this is called, the calling task may immediately "wake up" and could potentially observe a partial result before the code below clears it.

not only is callee.done incorrect, I got rid of Task.done altogether.

jhump · 2024-10-28T21:00:50Z

experimental/incremental/url.go

+// URLBuilder is a helper for building URLs.
+//
+// It is a simplified version of the interface of [net/url.URL].
+type URLBuilder struct {


This doesn't seem like a real "builder" pattern. Instead, this seems like a URI, not a builder, and the Build() method could just as reasonably be String().

Also, not sure this pays its freight -- it's extra surface area to maintain if/when we make it exported/non-experimental, and it doesn't seem hugely valuable.

Deleted it :)

jhump

I left a few comments, but I think this is really close. So LGTM modulo the few remarks.

jhump · 2024-11-04T21:31:15Z

experimental/incremental/query.go

-	return anyQuery{
-		url:     q.URL(),
+	return &AnyQuery{
+		key:     q.Key(),


This needs to also set actual: q.

experimental/incremental/task.go

jhump · 2024-11-06T02:28:24Z

experimental/incremental/task.go


-	if wg.Acquire(caller.ctx, 1) != nil {
+// run executes a query in the context of some task and writes the result to out.
+func (t *task) run(caller Task, q *AnyQuery, done func(*result)) (async bool) {


I think this and the one below are confusingly named. runAny actually blocks for the result and isn't really async. This is the async version, since it returns immediately and can spawn a goroutine to actually invoke done.

Maybe rename this one to start and the other to run?

mcy requested a review from jhump October 14, 2024 19:48

mcy mentioned this pull request Oct 16, 2024

Improve LSP performance for files that have a lot of cross-file refs bufbuild/buf#3403

Merged

jhump mentioned this pull request Oct 18, 2024

Implement a new AST library #352

Merged

jhump reviewed Oct 23, 2024

View reviewed changes

mcy requested a review from jhump October 24, 2024 20:49

jhump reviewed Oct 28, 2024

View reviewed changes

mcy requested a review from jhump October 30, 2024 17:30

jhump approved these changes Nov 6, 2024

View reviewed changes

mcy added 4 commits November 11, 2024 09:54

implement a query engine

5b96588

cyclic

b7f1399

nits

4fb1d70

cr

ab99841

mcy force-pushed the mcy/query branch from 1f4e3c9 to 9cd3488 Compare November 11, 2024 17:54

mcy enabled auto-merge (squash) November 11, 2024 17:55

mcy disabled auto-merge November 11, 2024 17:55

cr

51b4cc4

mcy force-pushed the mcy/query branch from 9cd3488 to 51b4cc4 Compare November 11, 2024 17:56

mcy enabled auto-merge (squash) November 11, 2024 17:56

mcy disabled auto-merge November 11, 2024 17:56

mcy enabled auto-merge (squash) November 11, 2024 17:57

mcy merged commit aa4cf26 into main Nov 11, 2024
8 checks passed

mcy deleted the mcy/query branch November 11, 2024 18:01

Implement a query engine for incremental compilation #355

Implement a query engine for incremental compilation #355

Conversation

mcy commented Oct 14, 2024

mcy commented Oct 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhump commented Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhump left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhump commented Oct 23, 2024 •

edited

Loading