Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition when resuming an aborted run #5600

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -803,16 +803,18 @@ class TaskProcessor {
while( true ) {
hash = HashBuilder.defaultHasher().putBytes(hash.asBytes()).putInt(tries).hash()

Path resumeDir = null
Path workDir = null
boolean exists = false
try {
final entry = session.cache.getTaskEntry(hash, this)
resumeDir = entry ? FileHelper.asPath(entry.trace.getWorkDir()) : null
if( resumeDir )
exists = resumeDir.exists()
workDir = entry
? FileHelper.asPath(entry.trace.getWorkDir())
: task.getWorkDirFor(hash)
if( workDir )
exists = workDir.exists()

log.trace "[${safeTaskName(task)}] Cacheable folder=${resumeDir?.toUriString()} -- exists=$exists; try=$tries; shouldTryCache=$shouldTryCache; entry=$entry"
final cached = shouldTryCache && exists && entry.trace.isCompleted() && checkCachedOutput(task.clone(), resumeDir, hash, entry)
log.trace "[${safeTaskName(task)}] Cacheable folder=${workDir?.toUriString()} -- exists=$exists; try=$tries; shouldTryCache=$shouldTryCache; entry=$entry"
final cached = shouldTryCache && exists && entry && entry.trace.isCompleted() && checkCachedOutput(task.clone(), workDir, hash, entry)
Comment on lines -810 to +817
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of that piece of code the lesser the changes the better, both to make it simple to review and to keep history readable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know. This is about as simple as it gets while keeping the intent clear.

  1. check for a cache entry and it's work dir, or compute the work dir if there is no cache entry
  2. check whether the work dir exists
  3. check for cached outputs if the cache entry and work dir exist and the task completed

and further down:
4. if the outputs are cached then use them
5. otherwise, if the work dir exists then use a new work dir
6. otherwise, create the work dir and use it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, getting better, but why adding entry && in the cached condition? If the intent is to applied the same logic when the entry is missing should not the condition remain the same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous logic, it wouldn't check for cached outputs if the cache entry was missing. That is also how it behaves here. The && entry is required to prevent a null reference exception on entry.trace.isCompleted()

if( cached )
break
}
Expand All @@ -826,11 +828,8 @@ class TaskProcessor {
}

final lock = lockManager.acquire(hash)
final workDir = task.getWorkDirFor(hash)
try {
if( resumeDir != workDir )
exists = workDir.exists()
if( !exists && !workDir.mkdirs() )
if( !workDir.mkdirs() )
Comment on lines 830 to +832
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been wondering about this lock over the task directory. I think the purpose is to prevent two tasks from using the same directory. But in that case maybe it should be over the previous try-catch block?

It should prevent two tasks from checking the same directory at the same time, because that is how a task determines whether to use the directory or try a different one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may. however what I'm thinking that to solve this issue it should be assumed a new task run should always use a newly created directory. Not sure this logic satisfy it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently this lock is useless, because if two tasks request the same directory, the lock will serialize the mkdirs() call but won't actually cause the second task to move to a different directory

throw new IOException("Unable to create directory=$workDir -- check file system permissions")
}
finally {
Expand Down
Loading