Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clone from local cache when possible #1238

Closed
wants to merge 2 commits into from

Conversation

ysangkok
Copy link
Contributor

Description of the change

Fixes #1208.

Checklist:

  • Added the change to the "Unreleased" section of the changelog
  • Added some example of the new feature to the README
  • Added a test for the contribution (if applicable)

@@ -286,7 +291,7 @@ writeNewLockfile reason allTransitiveDeps = do
processPackage :: LockfileBuilderResult -> Tuple PackageName (Tuple PackageName Package) -> Spago (FetchEnv a) LockfileBuilderResult
processPackage result (Tuple workspacePackageName (Tuple dependencyName dependencyPackage)) = do
(packageDependencies :: Array PackageName) <- (Array.fromFoldable <<< Map.keys <<< fromMaybe Map.empty)
<$> getPackageDependencies dependencyName dependencyPackage
<$> getPackageDependencies dependencyName dependencyPackage Map.empty
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems feasible to populate the map here, but OTOH it doesn't seem like it would save much time since the lock file is generated rarely.

To avoid unnecessary network traffic and unlocking of ssh keys,
let's only run the rebase when the repo is not freshly cloned.

We still have fetch/checkout/rebase split up to retain existing behaviour
where users can have e.g. unstaged changes that is autostashed.
@ysangkok
Copy link
Contributor Author

I pushed a commit to improve efficiency when an existing repo is updated. While developing this commit, I had some concerns:

  • In the case that you are switching to an unrelated branch, and you have local
    unstaged changes, the checkout will fail but the rebase will still run. Spago
    will autostash, which IMO is a bit confusing, cause you might want to start
    french if you switch to an unrelated branch. But I suppose this isn't too bad
    since support for local unstaged changes is a really experimental feature?

  • In the case that you are switching to a differnt branch, and you have local
    commits, they will be left behind because the checkout will just succeed. I
    suppose this is ok too, since the user shouldn't rely on their custom commits
    surviving a branch change. But it seems inconsistent with the autostashing.

  • In the case where the user is using a short-hash which becomes ambiguous, I
    am relying on rebase to fail with a message like

    % git rebase b756                                                               
    error: short object ID b756 is ambiguous                                        
    hint: The candidates are:                                                       
    hint:   b7565f941 commit 2012-09-29 - Extend the unpack command for the .cabal file updating    
    hint:   b75671b0b commit 2024-03-11 - Merge pull request #9792 from haskell/mergify/bp/3.12/pr-9768    
    fatal: invalid upstream 'b756'                                                  
    
  • In the case where the user is 'reverting' an extra-dep using a short hash, it
    will not actually switch to the new commit if there are unstaged changes. The
    result is that the new commit is passed to rebase, which will put the 'new'
    (undesired) commits on the old (desired) commit. This seems to be existing
    behaviour, so I am not changing it. And it shouldn't affect many since it
    relies on there being unstaged changes.

@f-f
Copy link
Member

f-f commented Jun 28, 2024

My understanding of this patch is that we'd only use the cache if we're cloning the same repo multiple times during the current execution, since we're using a ref and don't write it to disk.
I think this is subpar, as we'd re-clone it if you e.g. use the same repo for a different project. The alternative I propose would be to put the repo in the global cache, and copy it over if we find that we are cloning the same ref.
This would be strictly better for tags and commits, but would require some care for branches, because we wouldn't be able to cache those.
We wouldn't need the checkout logic either I think?

The flow would be:

  • if we need to clone a repo at a certain ref, look in the global cache, see if we have the same ref cached
  • if it's not cached, then clone it in a temp dir
    • if the ref is a branch (which we can figure out with a git command), then we can't cache, and we'd move it to the .spago dir
    • if the ref is a tag or a commit, then we'd move it to the global cache, and then copy it over to the .spago dir
  • if it's cached, we just copy it over to the .spago dir

Note: we'd end up cloning potentially multiple copies of a repo, at different refs. This is fine because we'd avoid doing a full clone anyways, and it spares us the checkout logic.

@ysangkok
Copy link
Contributor Author

I think the possibility of specifying a branch is making everything much more complicated. Especially because of the autostash and rebase flags that the codebase currently contains, which makes it look like rogue local commits and changes should survive. Is that a feature we really have to retain?

I think it would be a lot simpler if we started out only handling the case where a full 40 character SHA ref is specified. What would you think if I developed a patch for only that strategy, using your proposal with the global cache? Then it should be easier to test, since we no longer have to handle the case of the upstream branch moving. And we wouldn't handle autostash and rebasing either.

I am also a bit confused about all the places where getPackageDependencies is used. If we went with this reduced version of the feature, would it be compatible with the current interface of getPackageDependencies? That would make things a lot easier.

Also, would you propose having the global cache filled in a separate IO action, and then later functions would rely on that cache being filled? I tried such an approach in this commit, but it was buggy and the call graph was confusing to me, I couldn't figure out the invariants of where the cache was populated, and where it wasn't. And I don't really know how we'd make an API that somehow aligns the global cache with the type system, which would be nice.

@ysangkok
Copy link
Contributor Author

If we only supported refs as 40-character shas, effectively all the code might be contained to fetchRepo. That would then look in the global cache and always copy to the destination directory when passed a sha ref, which exists in the cache.

@f-f
Copy link
Member

f-f commented Jun 28, 2024

I think the possibility of specifying a branch is making everything much more complicated. Especially because of the autostash and rebase flags that the codebase currently contains, which makes it look like rogue local commits and changes should survive. Is that a feature we really have to retain?

I don't understand how the autostash and rebase flags are causing issues here - Spago's job ends when the source ends up in the right place of the .spago folder (i.e. the local cache).
Whatever happens after that (i.e. people editing the repos once they are sitting in the local cache) is the user's business. To my understanding that feature/flow is used quite a lot.
We should be cloning the repo exactly once though, so again, I am missing a concrete scenarion in which these flags are problematic

I think it would be a lot simpler if we started out only handling the case where a full 40 character SHA ref is specified. What would you think if I developed a patch for only that strategy, using your proposal with the global cache?

People will want to use tags - it's good UX to be able to see which tag one is using. Why is supporting branches and tags an issue? We can keep tags in the global cache, and we can figure out that something is a branch (and not cache it) once we clone it.

Also, would you propose having the global cache filled in a separate IO action, and then later functions would rely on that cache being filled?

What do you mean by that? It's easy to check if something is in the global cache, a FS.exists will do. Or are you referring to the thundering herd problem?
We use an AVar for that when fetching the Registry repos:

-- The Box AVar will be empty until the first time we fetch the Registry, then
-- we can just use the value that is cached.
-- The Lock AVar is used to make sure
-- that only one fiber is fetching the Registry at a time, and that all the other
-- fibers will wait for it to finish and then use the cached value.
{ db } <- ask
liftAff $ AVar.take registryLock

I am also a bit confused about all the places where getPackageDependencies is used.

Why is that an issue? This patch should pretty much be self-contained inside getGitPackageInLocalCache, which getPackageDependencies calls. No details should leak outside of getGitPackageInLocalCache: in there we'd check if the package is in global cache, if it's not then we clone it and eventually copy it over there! (as I detailed in my previous comment)

@ysangkok ysangkok mentioned this pull request Jun 30, 2024
3 tasks
@ysangkok
Copy link
Contributor Author

Thanks for your comments, I tried to adapt to them in #1240. So I am closing this.

@ysangkok ysangkok closed this Jun 30, 2024
@ysangkok ysangkok deleted the clone-from-local-cache branch June 30, 2024 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

spago install git clones for same monorepo package multiple times
2 participants