Clone from local cache when possible #1238

ysangkok · 2024-06-26T22:34:43Z

Description of the change

Checklist:

Added the change to the "Unreleased" section of the changelog
Added some example of the new feature to the README
Added a test for the contribution (if applicable)

ysangkok · 2024-06-26T22:35:56Z

src/Spago/Command/Fetch.purs

@@ -286,7 +291,7 @@ writeNewLockfile reason allTransitiveDeps = do
    processPackage :: LockfileBuilderResult -> Tuple PackageName (Tuple PackageName Package) -> Spago (FetchEnv a) LockfileBuilderResult
    processPackage result (Tuple workspacePackageName (Tuple dependencyName dependencyPackage)) = do
      (packageDependencies :: Array PackageName) <- (Array.fromFoldable <<< Map.keys <<< fromMaybe Map.empty)
-        <$> getPackageDependencies dependencyName dependencyPackage
+        <$> getPackageDependencies dependencyName dependencyPackage Map.empty


It seems feasible to populate the map here, but OTOH it doesn't seem like it would save much time since the lock file is generated rarely.

To avoid unnecessary network traffic and unlocking of ssh keys, let's only run the rebase when the repo is not freshly cloned. We still have fetch/checkout/rebase split up to retain existing behaviour where users can have e.g. unstaged changes that is autostashed.

ysangkok · 2024-06-27T18:06:26Z

I pushed a commit to improve efficiency when an existing repo is updated. While developing this commit, I had some concerns:

In the case that you are switching to an unrelated branch, and you have local
unstaged changes, the checkout will fail but the rebase will still run. Spago
will autostash, which IMO is a bit confusing, cause you might want to start
french if you switch to an unrelated branch. But I suppose this isn't too bad
since support for local unstaged changes is a really experimental feature?
In the case that you are switching to a differnt branch, and you have local
commits, they will be left behind because the checkout will just succeed. I
suppose this is ok too, since the user shouldn't rely on their custom commits
surviving a branch change. But it seems inconsistent with the autostashing.

In the case where the user is using a short-hash which becomes ambiguous, I
am relying on rebase to fail with a message like

% git rebase b756                                                               
error: short object ID b756 is ambiguous                                        
hint: The candidates are:                                                       
hint:   b7565f941 commit 2012-09-29 - Extend the unpack command for the .cabal file updating    
hint:   b75671b0b commit 2024-03-11 - Merge pull request #9792 from haskell/mergify/bp/3.12/pr-9768    
fatal: invalid upstream 'b756'

In the case where the user is 'reverting' an extra-dep using a short hash, it
will not actually switch to the new commit if there are unstaged changes. The
result is that the new commit is passed to rebase, which will put the 'new'
(undesired) commits on the old (desired) commit. This seems to be existing
behaviour, so I am not changing it. And it shouldn't affect many since it
relies on there being unstaged changes.

f-f · 2024-06-28T10:41:38Z

My understanding of this patch is that we'd only use the cache if we're cloning the same repo multiple times during the current execution, since we're using a ref and don't write it to disk.
I think this is subpar, as we'd re-clone it if you e.g. use the same repo for a different project. The alternative I propose would be to put the repo in the global cache, and copy it over if we find that we are cloning the same ref.
This would be strictly better for tags and commits, but would require some care for branches, because we wouldn't be able to cache those.
We wouldn't need the checkout logic either I think?

The flow would be:

if we need to clone a repo at a certain ref, look in the global cache, see if we have the same ref cached
if it's not cached, then clone it in a temp dir
- if the ref is a branch (which we can figure out with a git command), then we can't cache, and we'd move it to the .spago dir
- if the ref is a tag or a commit, then we'd move it to the global cache, and then copy it over to the .spago dir
if it's cached, we just copy it over to the .spago dir

Note: we'd end up cloning potentially multiple copies of a repo, at different refs. This is fine because we'd avoid doing a full clone anyways, and it spares us the checkout logic.

ysangkok · 2024-06-28T15:01:00Z

I think the possibility of specifying a branch is making everything much more complicated. Especially because of the autostash and rebase flags that the codebase currently contains, which makes it look like rogue local commits and changes should survive. Is that a feature we really have to retain?

I think it would be a lot simpler if we started out only handling the case where a full 40 character SHA ref is specified. What would you think if I developed a patch for only that strategy, using your proposal with the global cache? Then it should be easier to test, since we no longer have to handle the case of the upstream branch moving. And we wouldn't handle autostash and rebasing either.

I am also a bit confused about all the places where getPackageDependencies is used. If we went with this reduced version of the feature, would it be compatible with the current interface of getPackageDependencies? That would make things a lot easier.

Also, would you propose having the global cache filled in a separate IO action, and then later functions would rely on that cache being filled? I tried such an approach in this commit, but it was buggy and the call graph was confusing to me, I couldn't figure out the invariants of where the cache was populated, and where it wasn't. And I don't really know how we'd make an API that somehow aligns the global cache with the type system, which would be nice.

ysangkok · 2024-06-28T15:03:09Z

If we only supported refs as 40-character shas, effectively all the code might be contained to fetchRepo. That would then look in the global cache and always copy to the destination directory when passed a sha ref, which exists in the cache.

f-f · 2024-06-28T15:54:30Z

I think the possibility of specifying a branch is making everything much more complicated. Especially because of the autostash and rebase flags that the codebase currently contains, which makes it look like rogue local commits and changes should survive. Is that a feature we really have to retain?

I don't understand how the autostash and rebase flags are causing issues here - Spago's job ends when the source ends up in the right place of the .spago folder (i.e. the local cache).
Whatever happens after that (i.e. people editing the repos once they are sitting in the local cache) is the user's business. To my understanding that feature/flow is used quite a lot.
We should be cloning the repo exactly once though, so again, I am missing a concrete scenarion in which these flags are problematic

I think it would be a lot simpler if we started out only handling the case where a full 40 character SHA ref is specified. What would you think if I developed a patch for only that strategy, using your proposal with the global cache?

People will want to use tags - it's good UX to be able to see which tag one is using. Why is supporting branches and tags an issue? We can keep tags in the global cache, and we can figure out that something is a branch (and not cache it) once we clone it.

Also, would you propose having the global cache filled in a separate IO action, and then later functions would rely on that cache being filled?

What do you mean by that? It's easy to check if something is in the global cache, a FS.exists will do. Or are you referring to the thundering herd problem?
We use an AVar for that when fetching the Registry repos:

spago/src/Spago/Registry.purs

Lines 109 to 115 in 66b5f8d

    
           -- The Box AVar will be empty until the first time we fetch the Registry, then 
        
           -- we can just use the value that is cached. 
        
           -- The Lock AVar is used to make sure 
        
           -- that only one fiber is fetching the Registry at a time, and that all the other 
        
           -- fibers will wait for it to finish and then use the cached value. 
        
           { db } <- ask 
        
           liftAff $ AVar.take registryLock

I am also a bit confused about all the places where getPackageDependencies is used.

Why is that an issue? This patch should pretty much be self-contained inside getGitPackageInLocalCache, which getPackageDependencies calls. No details should leak outside of getGitPackageInLocalCache: in there we'd check if the package is in global cache, if it's not then we clone it and eventually copy it over there! (as I detailed in my previous comment)

ysangkok · 2024-06-30T20:21:17Z

Thanks for your comments, I tried to adapt to them in #1240. So I am closing this.

Clone from local cache when possible

e0a4e5f

ysangkok commented Jun 26, 2024

View reviewed changes

ysangkok mentioned this pull request Jun 30, 2024

Introduce global git cache #1240

Open

3 tasks

ysangkok closed this Jun 30, 2024

ysangkok deleted the clone-from-local-cache branch June 30, 2024 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clone from local cache when possible #1238

Clone from local cache when possible #1238

ysangkok commented Jun 26, 2024

ysangkok Jun 26, 2024

ysangkok commented Jun 27, 2024

f-f commented Jun 28, 2024 •

edited

Loading

ysangkok commented Jun 28, 2024

ysangkok commented Jun 28, 2024

f-f commented Jun 28, 2024

ysangkok commented Jun 30, 2024

Clone from local cache when possible #1238

Clone from local cache when possible #1238

Conversation

ysangkok commented Jun 26, 2024

Description of the change

Checklist:

ysangkok Jun 26, 2024

Choose a reason for hiding this comment

ysangkok commented Jun 27, 2024

f-f commented Jun 28, 2024 • edited Loading

ysangkok commented Jun 28, 2024

ysangkok commented Jun 28, 2024

f-f commented Jun 28, 2024

ysangkok commented Jun 30, 2024

f-f commented Jun 28, 2024 •

edited

Loading