Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of finding indexables #2082

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

natematykiewicz
Copy link
Contributor

@natematykiewicz natematykiewicz commented May 22, 2024

Currently, all folders and files in the current tree are turned into IndexablePath, and then excluded files are filtered out after.

When there are large file trees that are meant to be excluded, this results in a lot of unnecessary work.

ActiveStorage stores files in the tmp directory in many many small folders. So does Bootsnap. Ruby LSP has to traverse all of these files, even though the entire directory should just be ignored.

Rubocop has solved this by breaking the includes patterns up into many patterns, applying the exclusions before the Dir.glob, so I followed in their footsteps. This works great for exclusions that end in "**/*". We still need to loop through all IndexablePath objects and see if they're excluded, in the case that an extension was provided on the excluded path, but this can cut down load time dramatically.

Before this PR in my Rails app, indexables took 76 seconds to run. Now it takes, 0.19 seconds. Before and after code both return the same exact file list.

Additionally, I added node_modules to the list of excluded trees, since that can be very large and never includes Ruby files.

I also removed the *.rb from the bundler path. Having a file extension on that means we need to scan all files. But we simply want to ignore the entire bundler path tree. I kind of wonder if we should always replace *.rb with *, to help people improve performance. Excluding an actual file name or partial file name makes sense. But excluding the only file extension we scan means we can do it faster by excluding the whole folder.

Motivation

Opening a ruby file caused my LSP server to print "Ruby LSP: indexing files" for 76 seconds at 0% before the progress bar starts moving.

Implementation

I knew that Rubocop has solved this problem before, so I looked at this file and followed what they did.

Automated Tests

I added tests for the new pattern exclude_pattern that gets used with fnmatch, while ensuring I didn't break any existing tests.

Manual Tests

I made a file that has both implementations of indexables on my computer. Then ran this in the Rails Console:

Benchmark.measure { RubyIndexer::Configuration.new.indexables }
=>
#<Benchmark::Tms:0x000000012a2dead0
 @cstime=0.0,
 @cutime=0.0,
 @label="",
 @real=0.19083199999295175,
 @stime=0.0784180000000001,
 @total=0.18055299999999974,
 @utime=0.10213499999999964>


Benchmark.measure { RubyIndexer::Configuration.new.old_indexables }
=>
#<Benchmark::Tms:0x000000012a374b20
 @cstime=0.0,
 @cutime=0.0,
 @label="",
 @real=76.83584400010295,
 @stime=13.666566,
 @total=14.943883,
 @utime=1.277317>

old_indexables = RubyIndexer::Configuration.new.old_indexables
indexables = RubyIndexer::Configuration.new.indexables

indexables.size == old_indexables.size
=> true

indexables.map(&:full_path).sort == old_indexables.map(&:full_path).sort
=> true

indexables.size
=> 13933

So here you can see that it finds the same ~14k files in 0.25% of the time.

Currently, all folders and files in the current tree are turned into
IndexablePath, and then excluded files are filtered out after.

When there are large file trees that are meant to be excluded, this
results in a lot of unnecessary work.

ActiveStorage stores files in the `tmp` directory in many many small
folders. So does Bootsnap. Ruby LSP has to traverse all of these files,
even though the entire directory should just be ignored.

Rubocop has solved this by breaking the `includes` patterns up into many
patterns, applying the exclusions *before* the `Dir.glob`, so I followed
in their footsteps. This works great for exclusions that end in "**/*".
We still need to loop through all IndexablePath objects and see if
they're excluded, in the case that an extension was provided on the
excluded path, but this can cut down load time dramatically.

Before this PR in my Rails app, `indexables` took 76 seconds to run. Now
it takes, 0.17 seconds. Before and after code both return the same exact
file list.
@natematykiewicz
Copy link
Contributor Author

I just resolved the merge conflict

@andyw8 andyw8 added enhancement New feature or request server This pull request should be included in the server gem's release notes labels May 23, 2024
Copy link
Member

@vinistock vinistock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvements 🚀

lib/ruby_indexer/lib/ruby_indexer/configuration.rb Outdated Show resolved Hide resolved
lib/ruby_indexer/lib/ruby_indexer/configuration.rb Outdated Show resolved Hide resolved
lib/ruby_indexer/lib/ruby_indexer/configuration.rb Outdated Show resolved Hide resolved
lib/ruby_indexer/lib/ruby_indexer/configuration.rb Outdated Show resolved Hide resolved
@natematykiewicz
Copy link
Contributor Author

@vinistock I believe I've addressed all of your comments

@natematykiewicz
Copy link
Contributor Author

@vinistock do you need anything else from me on this?

lib/ruby_indexer/lib/ruby_indexer/configuration.rb Outdated Show resolved Hide resolved
lib/ruby_indexer/lib/ruby_indexer/configuration.rb Outdated Show resolved Hide resolved
Comment on lines 176 to 178
relative_patterns = @excluded_patterns
.select { |p| p.end_with?("/**/*") }
.map { |p| p.delete_prefix("#{Dir.pwd}/") }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is assuming that every pattern provided by the user is inside Dir.pwd, but that may not be true and I don't think we should tie ourselves to that yet.

I think we would need to change the check here to verify that the pattern both ends with wildcards and starts with Dir.pwd or **.

Suggested change
relative_patterns = @excluded_patterns
.select { |p| p.end_with?("/**/*") }
.map { |p| p.delete_prefix("#{Dir.pwd}/") }
relative_patterns = @excluded_patterns
.select { |p| p.end_with?("/**/*") && (p.start_with?(Dir.pwd) || p.start_with?("**")) }
.each { |p| p.delete_prefix!("#{Dir.pwd}/") }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, though I don't think we want the ! on delete_prefix. We're allocating a new array, but the strings are still the ones in @excluded_patterns, which means we'd be mutating the content of @excluded_patterns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added that change to the select line, but kept what I had on the map line, since I don't think we want to mutate @excluded_patterns' strings. Let me know if you think otherwise.
3a8c7fe

Copy link
Contributor Author

@natematykiewicz natematykiewicz Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, included_patterns should be relative to Dir.pwd, right? apply_config doesn't automatically prefix it with Dir.pwd, and I don't see how users would realistically set Dir.pwd in the YAML, considering they'd all have the repo checked out to different folders. Feels like we need to be prepending Dir.pwd to every entry in included_patterns. It also feels like excluded_patterns should similarly be expected to be within Dir.pwd. I'm struggling to understand what files someone would be including / excluding that aren't in a repo or a gem. But from what I can tell, excluded_patterns refers solely to application code, not gems. You exclude entire gems, not paths within a gem.

Right now I'm feeling like include_patterns doesn't even work, unless the user provides an absolute path to the folder on disk, which isn't very portable. What's the use-case for this? Understanding will help me figure out how to best optimize this, because the goal is to avoid calling Dir.glob on folders that are excluded, which means we need to take the included patterns, apply the exclude patterns that end with wildcards against them (essentially making include_patterns a larger list of folders), to cut down on which folders Dir.glob scans.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good point. Now that I think of it, all patterns have to be relative to something, since you can clone a project in any directory.

I'll bring this up with the team, but I think we should indeed make everything relative to Dir.pwd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay great, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're including File.join(Dir.pwd, "**", "*.rb") by default, all I can really think that people might be including is non-.rb ruby files, like a Gemfile or Guardfile. So that leaves me really wondering what people even use include_patterns for. When people specify include_patterns, it doesn't seem to remove the default pattern. So anything they're including is likely already in the existing pattern, unless it doesn't end with .rb or is outside of Dir.pwd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rubocop's Include config option is relative to Dir.pwd. And it just is meant for adding things like Gemfile or config.ru.

https://docs.rubocop.org/rubocop/configuration.html#unusual-files-that-would-not-be-included-by-default

This says that Rubocop by default:

Finds all Ruby source files under the current or other supplied directory. A Ruby source file is defined as a file with the .rb extension or a file with no extension that has a ruby shebang line as its first line.

https://github.com/rubocop/rubocop/blob/aee64e2d5d6615ef64803af52c1b82bf95b566ef/lib/rubocop/target_finder.rb#L33-L35

It accepts a directory because you can run the Rubocop CLI against a path relative to Dir.pwd (for example, just linting your models).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @vinistock, where'd you guys land on this? Should I assume that all paths are relative to Dir.pwd? If so, should I update apply_config, so that include_patterns are prefixed with Dir.pwd?

lib/ruby_indexer/lib/ruby_indexer/configuration.rb Outdated Show resolved Hide resolved
end

sig { params(base_directory: String).returns(T::Array[String]) }
def indexable_included_file_patterns(base_directory)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is quite complex. Can we add some comments explaining the steps that this takes and include an example of a pattern before and after?

Comment on lines +62 to +63
patterns = indexable_included_file_patterns(Dir.pwd, merge_exclude_patterns).map! { |dir| File.join(dir, "*.rb") }
patterns = [File.join(Dir.pwd, "**/*.rb")] if patterns.empty?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm realizing I've totally forgotten about @included_patterns and just treated it as hardcoded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. We accept users defining extra glob patterns to index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request server This pull request should be included in the server gem's release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants