Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10% of parse time is tracking line numbers #107

Open
evmar opened this issue Jan 19, 2024 · 4 comments
Open

10% of parse time is tracking line numbers #107

evmar opened this issue Jan 19, 2024 · 4 comments

Comments

@evmar
Copy link
Owner

evmar commented Jan 19, 2024

We mark each build with its line number for use in error messages, but computing the line number requires keeping track of each newline we encounter.

Some experiments:

  • if I comment out that logic, parsing was 10% faster
  • we also store an Rc on each build to store the filename (error messages look like foo.ninja:123: some error), but that appeared to be ~no impact on perf -- I imagine it's just a trivial increment of a non-atomic refcount for each build statement
  • another experiment I tried was to compute line numbers only when needed: each time we need a line number, scan forwards from the last time we computed a line number and count lines, caching the result. This didn't help.

A trick I learned from LLVM is to instead annotate each build with its file byte offset when parsing, which is faster to gather. Then if we encounter an error we can spend the time to re-parse the file for newlines to find the offset. Parsing is slow but it's like milliseconds slow, it's fine to do in an error message codepath.

Unfortunately doing this would mean we need to either keep around the build file text at runtime, or re-read the file when generating an error message.

If we kept the file text at runtime:

  • the text is like tens of megabytes for the biggest build I currently have (LLVM CMake) but I believe the Android build file text might be much larger(?). Maybe it's fine to waste some memory on it, but it feels pretty wasteful...
    -there's a couple places where we could avoid allocating a copy of some strings found within the file, though I think those are pretty minor.

If we re-read the file when generating error messages:

  • maybe this is fine? I didn't investigate, feels like kind of a rabbit hole
@Colecf
Copy link
Contributor

Colecf commented Jan 19, 2024

I believe the Android build file text might be much larger(?)

Some numbers:

aosp-main$ du -h out/build-sdk_phone64_x86_64.ninja 
935M	out/build-sdk_phone64_x86_64.ninja
aosp-main$ du -h out/soong/build.sdk_phone64_x86_64.ninja
2.3G	out/soong/build.sdk_phone64_x86_64.ninja
internal-main$ du -h out/build-sdk_phone64_x86_64.ninja 
1.3G	out/build-sdk_phone64_x86_64.ninja
internal-main$ du -h out/soong/build.sdk_phone64_x86_64.ninja
3.9G	out/soong/build.sdk_phone64_x86_64.ninja

So yeah, it would be quite a bit of memory used just for line numbers...

@evmar
Copy link
Owner Author

evmar commented Jan 21, 2024

I was about to say that's a convincing argument for not keeping the files in memory, but I guess if we mmap'd the files the memory use of keeping them around is "free", in that the memory backing of the mmap could be paged out by the OS under memory pressure... I think(?)

(I had thought that we only used line numbers in error messages, which would mean we'd only ever need to reread a file once. But looking now we also use them in -d explain mode to print which builds are out of date...)

Just to write it down, the full LLVM trick is: map each input file to an integer (e.g. put them in a Vec), and then pack that integer and the file offset together. So you only need to store a single integer per thing you want to annotate with a file location. This matters more for LLVM because I think they store the file location of every AST node (for compiler diagnostics etc.)

In n2's case we only store one of these per build statement. Currently it's a not especially efficient Rc+usize so 16 bytes per Build in memory. For a big Android build that's probably only still costing a few megabytes so probably not worth worrying about too much.

@evmar
Copy link
Owner Author

evmar commented Jan 21, 2024

Note to self, to find all the places we use line numbers, the trick is to comment out the impl std::fmt::Display for FileLoc and see where the errors are.

@Colecf
Copy link
Contributor

Colecf commented Feb 28, 2024

After #112, I tried disabling the line counting and wasn't able to observe a noticeable speedup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants