Misleading failure caching #144

wizeman · 2014-01-24T14:26:46Z

Please take the following with a grain of salt, as I am new to using Hydra.

If I do a commit N to add a new derivation in nixpkgs (or upgrade an existing one) and put down the correct hash in fetchurl, but a bad source URL, then any jobs that depend on this derivation will fail (because the source file cannot be downloaded).

However, if I fix the URL in commit N+1, then Hydra will not actually restart the job, and instead it will say that the build is still failing, with the same error as before.

From a user perspective this is misleading because you can clearly see that the input to the newest jobset evaluation is the new commit N+1, but the build will still reflect the results from commit N.

In fact, even if you "clear failed builds cache" and "clear VCS caches" and then push a new unrelated commit (to force a new jobset evaluation), the failure will still persist, which is even more confusing. And if you read the nix.conf man page, it says the following:

build-cache-failures (...) Failures in fixed-output derivations (such as fetchurl calls) are never cached. (...)

... which is even more misleading if you're tring to understand the problem, given that it is a fetchurl failure, but it appears to have become a cached failure.

As far as I can see, the problem seems to be in hydra-evaluator, specifically src/lib/Hydra/Helper/AddBuilds.pm#492, function checkBuild():

    492         # Don't add a build that has already been scheduled for this
    493         # job, or has been built but is still a "current" build for
    494         # this job.  (...)
(...)
    503         if (defined $prevEval) {
    504             # Only check one output: if it's the same, the other will be as well.
    505             my $firstOutput = $outputNames[0];
    506             my ($prevBuild) = $prevEval->builds->search(
    507                 # The "project" and "jobset" constraints are
    508                 # semantically unnecessary (because they're implied by
    509                 # the eval), but they give a factor 1000 speedup on
    510                 # the Nixpkgs jobset with PostgreSQL.
    511                 { project => $jobset->project->name, jobset => $jobset->name, job => $jobName,
    512                   name => $firstOutputName, path => $firstOutputPath },
    513                 { rows => 1, columns => ['id'], join => ['buildoutputs'] });
    514             if (defined $prevBuild) {
    515                 print STDERR "    already scheduled/built as build ", $prevBuild->id, "\n";
    516                 $buildMap->{$prevBuild->id} = { id => $prevBuild->id, jobName => $jobName, new => 0, drvPath => $drvPath };
    517                 return;
    518             }
    519         }

So it seems that if the output name and therefore, the inputs haven't changed, then hydra-evaluator won't schedule a new build. But changing the URL doesn't change the inputs or the outputs, therefore even though I fixed the underlying problem in building the derivation, a new build won't be scheduled.

Correct me if I'm wrong, but the code above doesn't seem to take into account transient fetchurl failures, and it also seems redundant, considering that Nix already caches build successes and failures.

A similar problem (although more of a race condition) seems to exist just a few lines below:

    521         # Prevent multiple builds with the same (job, outPath) from
    522         # being added.
    523         my $prev = $$jobOutPathMap{$jobName . "\t" . $firstOutputPath};
    524         if (defined $prev) {
    525             print STDERR "    already scheduled as build ", $prev, "\n";
    526             return;
    527         }

This means that if I put a bad URL in commit N and hydra evaluates the jobset, then I fix the URL in commit N+1 and hydra evaluates it again, but the previous job is still queued, then a new job won't be queued and instead the result of commit N+1 will reflect the failure in commit N, even though the problem has already been fixed in commit N+1.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misleading failure caching #144

Misleading failure caching #144

wizeman commented Jan 24, 2014

Misleading failure caching #144

Misleading failure caching #144

Comments

wizeman commented Jan 24, 2014