Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't glob if glob ain't glob #10502

Closed

Conversation

parsonsmatt
Copy link
Collaborator

@parsonsmatt parsonsmatt commented Nov 1, 2024

This PR modifies the behavior around globs, such that literal filepaths (ie a string that does not contain a glob special character "{,}*") skip the glob logic and are treated as bare filepaths.

Why?

I've identified that the glob logic is quite slow. Locally, it takes about 5-10ms to run through the glob checker on literal filepaths. Meanwhile, doesFileExist is measured in ~200ns or so. This wouldn't be a huge deal on it's own, but we have about 1,800 lines of extra-source-files, and the time spent in Expanding glob on our codebase is about 35 seconds, the vast majority of the 45-50 second lag time between cabal repl --repl-no-load and an empty GHCi prompt being available.

On our codebase, this patch reduces the initial boot time to just 14 seconds.

This patch followed a "type directed" approach, and this resulted in several breaking changes to the API. I can rework this to be a less breaking change, but it felt like the easiest way to ensure I covered all points of behavior.

This PR addresses the primary slowdown experienced in #10495.

Template Α: This PR modifies behaviour or interface

Include the following checklist in your PR:

  • Patches conform to the coding conventions.
  • Any changes that could be relevant to users have been recorded in the changelog.
  • The documentation has been updated, if necessary.
  • Manual QA notes have been included.
  • Tests have been added. (Ask for help if you don’t know how to write them! Ask for an exemption if tests are too complex for too little coverage!)

@geekosaur
Copy link
Collaborator

Just to be clear, you'll have to use cabal master / preleases or wait for 3.16 with a breaking change, since it's probably too late to squeeze this into 3.14 (needs to go out when ghc 9.12.1 does) and API-breaking changes can't be backported to 3.12 or point releases of 3.14.

Copy link
Collaborator

@geekosaur geekosaur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although Fourmolu seems unhappy with it.

@parsonsmatt
Copy link
Collaborator Author

Of course! This patch is meant as a quick proof-of-concept to show perf improvements. We'll likely use a patched cabal for our developers while this is getting merged in; the API changes are kinda ugly and I anticipate that a different approach may be better.

Comment on lines +57 to +61
-- TODO: This is kind of confusing - I feel like someone writing that they
-- expect a file `foo/c.html` in `extra-source-files` should probably
-- get a Problem if the match in question is `foo/c.html/blah` ?
-- , testCase "literal no match on prefix" $
-- testMatches "foo/c.html" (Left "foo/c.html")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open question here still

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems odd, yeah. I'd support making it an error, although it would be really nice if we had a crater-like tool we could use to check the impact of a change like this.

=> [Either FilePath Glob] -- data-files globs.
-> [Either FilePath Glob] -- extra-source-files globs.
-> [Either FilePath Glob] -- extra-doc-files globs.
-> [Either FilePath Glob] -- extra-files globs.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Either blindness here is a bit of a smell imo, probably better for Glob to have a OopsAllLiterals constructor?

-> CheckM m ()
checkMissingDocs dgs esgs edgs efgs = do
extraDocSupport <- (>= CabalSpecV1_18) <$> asksCM ccSpecVersion

mpackageOps <- asksCM $ ciPackageOps . ccInterface
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels pretty awkward to me - liftInt doesn't give a way to use two interfaces, so I have to smuggle this one in here. Could push the liftInt calls down to using the appropriate interfaces more locally.

Comment on lines +225 to +228
parseFileGlob :: CabalSpecVersion -> FilePath -> Either GlobSyntaxError (Either FilePath Glob)
parseFileGlob _version filepath
| all (`notElem` filepath) "*{}," =
Right (Left filepath)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core change in this PR. If we don't actually use any glob-like characters, it's not a glob, so we can skip the slow glob treatment.

@@ -1019,7 +1026,23 @@ checkMissingDocs dgs esgs edgs efgs = do
-- 2. Realise Globs.
let realGlob t =
concatMap globMatches
<$> mapM (runDirFileGlobM ops "") t
<$> mapM
( \efpg ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
( \efpg ->
( \pathOrGlob ->

Copy link
Collaborator

@9999years 9999years left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. efpg is a bad variable name though, fix that one if you would :)

@parsonsmatt parsonsmatt marked this pull request as ready for review November 4, 2024 15:24
Copy link
Collaborator Author

@parsonsmatt parsonsmatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've marked the PR as ready for review. I have two main questions at this point:

  1. What's the recommendation for a less invasive change to Cabal here, so we can potentially get this back-ported to earlier versions?
  2. There are a few TODOs left in PR - would appreciate some guidance on those.
  3. The Either profileration here feels a little gross - not generally a fan of primitive blindness like that. Would we prefer to see something like data PathOrGlob = PlainFilepath Filepath | ActuallyGlob Glob? Happy to take suggestions on naming here.

pure $
if exist
then [GlobMatch filepath]
else [] -- TODO: probably should signal failure here
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the one hand, we'll have already reported a problem if the filepath literal glob didn't match something previously. This check is trying to see if there are any other notable files present in these globs. So this triggering an error condition is unlikely: we'd need to have, like, extra-source-files: README.md and then README.md not being present in the directory.

I think this is redundant, so it's probably fine to have [] here.

A future improvement will remove the realGlob function from this file and accept the [GlobResult FilePath] directly as arguments, removing the redundant call to runDirFileGlobM.

@parsonsmatt
Copy link
Collaborator Author

I'm going to close in favor of #10518 which should be back-port-able and is a much smaller change actually. Should let me work on the "avoid globbing twice" problem separately

@parsonsmatt parsonsmatt closed this Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants