Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create technical metadata audit mechanism #515

Open
justinlittman opened this issue Mar 20, 2024 · 1 comment
Open

Create technical metadata audit mechanism #515

justinlittman opened this issue Mar 20, 2024 · 1 comment
Assignees

Comments

@justinlittman
Copy link
Contributor

As revealed by #485 and #510, it is possible for the techmd service to get out of sync with an item (most notably, missing files). To assist with the remediation of these cases, it would be useful to be able to audit the techmd system.

One approach to this might be:

  • Add an audit endpoint to techmd API that will accept a druid and array of filepath, md5 pairs (like the existing create endpoint). The techmd service will perform an audit to make sure that corresponds with the techmd records and return the results.
  • Add an auditing report to DSA which, for each closed DRO will invoke the audit endpoint and record the results.
@andrewjbtw
Copy link

I'm aware of two tricky cases to watch out for:

1. Filename on disk does not match filename in current version Cocina (filename has changed)

How does this happen?

When a filename is changed, the Moab does not receive a new copy of the file. Instead, the Moab manifests are updated to associate the existing file on disk with the new filename.

Example druid: https://argo.stanford.edu/view/druid:yw479qv6748

Some of the files in this druid were renamed after Moab version 1. Take these 3 files:

yw479qv6748-1.dderr
yw479qv6748-1.img
yw479qv6748-1.img.sha

These were deposited in Moab version 1 with different names before a later update modified the names (but not the content):

/pres-01/sdr2objects/yw/479/qv/6748/yw479qv6748/v0001/data/content/da39a3ee5e6b4b0d3255bfef95601890afd80709.dderr
/pres-01/sdr2objects/yw/479/qv/6748/yw479qv6748/v0001/data/content/da39a3ee5e6b4b0d3255bfef95601890afd80709.img
/pres-01/sdr2objects/yw/479/qv/6748/yw479qv6748/v0001/data/content/da39a3ee5e6b4b0d3255bfef95601890afd80709.img.sha

The technical metadata service generated the techMD for those three files and associated it with the names on disk:

 Fileda39a3ee5e6b4b0d3255bfef95601890afd80709.dderr

    filetype x-fmt/111
    mimetype text/plain
    bytes 26634
    file_modification 2013-06-20T21:14:15.000Z

Fileda39a3ee5e6b4b0d3255bfef95601890afd80709.img

    filetype fmt/1087
    bytes 368640
    file_modification 2013-06-20T21:14:15.000Z

Fileda39a3ee5e6b4b0d3255bfef95601890afd80709.img.sha

    filetype x-fmt/111
    mimetype text/plain
    bytes 173
    file_modification 2013-06-20T21:14:15.000Z

This is not wrong but does mean that the techMD is not associated with the current filenames.

2. Filename in Cocina is not present on disk (file is a duplicate)

How does this happen?

When duplicate files are accessioned into the same Moab, the Moab only stores one copy of the file on disk. The Moab manifest records the filename for the other copy in the Moab manifest, which associates the name with the single copy stored on disk.

I don't have a current example of the this condition because it's hard to find druids with duplicate files. The last time this came up was the issue that motivated #485 . We had a set of druids where there was techMD for the specific file stored on disk but not for the filename (in the Cocina) of the duplicate copy/copies of the files not stored on disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress (Not Ordered)
Development

No branches or pull requests

3 participants