Timeout while rescanning a previously scanned manifest leaves IndexReport empty forever #1124

iainduncani · 2023-10-26T06:10:17Z

The Problem

We are finding that sometimes when we scan a manifest that has already been scanned we will end up with an empty IndexReport stored for that manifest:

{
  "manifest_hash": "sha256:a2f196cc1f224ba665cde842a5be1d3c7ded333c279525242eba123677152172",
  "state": "",
  "packages": {},
  "distributions": {},
  "repository": {},
  "environments": {},
  "success": false,
  "err": ""
}

Once in this state any further request to rescan it will not do anything to change the IndexReport. The logs from the call to scan the previously scanned manifest show:

info index request start
debug locking attempt
debug locking OK
info starting scan
info index request done

The logs on any future attempts show an almost identical output except they include the manifest already scanned line:

info index request start
debug locking attempt
debug locking OK
info starting scan
info manifest already scanned
info index request done

Suspected Code Path

Looking at the code I think this happens when an attempt to get the IndexReport in checkManifest timeouts. The flow I believe is:

The Controller enters the run method and calls CheckManifest as the first state. At this point the IndexReport on the controller is empty
checkManifest discovers that the manifest is already scanned
checkManifest attempts to load the IndexReport but the call to s.Store.IndexReport times out (5s timeout) and an err is returned. checkManifest does not log anything in this case and IndexReport on the controller is not set because it was not loaded.
Back in the calling controller it goes through the checks as to whether an error occurred. As this is a timeout errors.Is(err, context.DeadlineExceeded) will be true and it will go into that case
As the timeout case does not exit the for loop early it just sets the retry variable to true the Controller then proceeds to store the IndexReport into the database. This stores an empty IndexReport into the database.
The Controller then continues by going into the retry logic which pauses for a bit and then sets err to nil.
When checkManifest returned it returned the next state as Terminal so Controller now sees this and exits

I think this is consistent with what we see in the logs where we get no logging at all indicating an error has occurred. Nowhere in the code path above does a log get emitted with the error from checkManifest. Future calls to scan on this manifest follow a very similar path except the call to IndexReport does not timeout but all it loads is an empty IndexReport, it then emits the manifest already scanned log line and does not return an error but we are left with the empty IndexReport which check manifests thinks is OK.

The solution

I think there are a couple of things to be addressed here. I'm happy to work on a PR but would like advice first as I think some change to behaviour is required.

Retry logic in controller

I do not think that the retry is doing anything. Every state that I can see that returns an error also sets the next state to Terminal. This means if any of these errors are timeouts then they will go through the retry logic path of having a small pause before seeing that the next state is Terminal and exiting.

Personally I would delete the errors.Is(err, context.DeadlineExceeded) case and retry variables rather than attempting to fix them (the fix would be to change that case to have what is currently inside the if retryblock and put acontinueon the end so it goes back around the loop without changing thenextstate and without persisting a potentially invalidIndexReport`).

Checking the validity of manifests in `checkManifest`

Fixing the retry logic maybe enough to fix this but care would need to be taken to not persist a bad IndexReport when the call to s.Store.IndexReport fails. Even if the Controller's standard error handling code was used for one of these errors then it would result in an IndexReport being persisted with state IndexError which again isn't what is wanted. I wonder if a check should be added to checkManifest to actually make sure the loaded IndexReport is valid, something like:

if sr.State != IndexFinished {
    zlog.Info(ctx).Msg("IndexReport not valid so manifest to be scanned")
    // Potentially do the filtering that happens in the case where a Manifest has not been scanned, but in this case there will be no scanners left as we know if we get this far `ManifestScanned` returned `ok`
    s.Vscnrs = indexer.VersionedScanners{}
    return FetchLayers, nil
}

The text was updated successfully, but these errors were encountered:

iainduncani · 2023-10-31T09:12:03Z

I put a proposed solution into PR #1131 with questions about the right approach to take.

iainduncani mentioned this issue Oct 31, 2023

Check manifest to check index report #1131

Draft

hdonnay added the triaged The maintainers have seen this issue label Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout while rescanning a previously scanned manifest leaves IndexReport empty forever #1124

Timeout while rescanning a previously scanned manifest leaves IndexReport empty forever #1124

iainduncani commented Oct 26, 2023

iainduncani commented Oct 31, 2023

Timeout while rescanning a previously scanned manifest leaves IndexReport empty forever #1124

Timeout while rescanning a previously scanned manifest leaves IndexReport empty forever #1124

Comments

iainduncani commented Oct 26, 2023

The Problem

Suspected Code Path

The solution

Retry logic in controller

Checking the validity of manifests in checkManifest

iainduncani commented Oct 31, 2023

Checking the validity of manifests in `checkManifest`