Semantics of reprocessing data #33

dpwrussell · 2018-08-24T15:30:21Z

There are several use-cases that warrant reprocessing of data:

Failure during the scan stage to identify a fileset that might be a fixed in a new version of the scanner.
Failure during the extract stage to successfully extract a fileset that might be fixed in a new version of the extractor.
Failure during the scan/extract stage due to unpredicted serverside error that has been resolved.
Even if an extract phase is successfully completed, the extracted metadata or images might be less than optimal and benefit from reprocessing the fileset.

The exact semantics of this needs to be defined before coming up with an implementation strategy.

Questions:

Is a reprocessed import entirely replaced by the reprocessed one?
Is a reprocessed fileset entirely replaced by the reprocessed one?
If reprocessed imports/filesets do not replace the originals, what happens to the originals and how do we record this in the database?

Provide feedback