IQSS/10108 Stata mimetype refinement for direct upload #11054
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it: As noted in the issue and recently in https://groups.google.com/g/dataverse-community/c/TBx0TTins2k, our ~extensible mimetype detector mechanism assumes starting from a local temp file which isn't useful with direct uploads. The most visible problem we have with this is that all Stata *.dta files get an application/x-status type to start and our ingest sends all of those to the Stata 13 ingestor, which then fails for Stata 14/15 files. With normal uploads, the local detector reads the first few bytes of the file and assigns a more specific type, e.g. application/x-stata-14 (or 15, etc) which gets routed to the correct version of the Stata ingestion code.
Since the determination of Stata version only relies on reading the first ~42 bytes of the file, and only needs those in a buffer/doesn't require a File to start from, this PR adds code to retrieve the required bytes and run the Stata version check on direct upload(S3) and presumably remote files/Globus files on S3 (cases where the storageidentifier is provided during upload and where getInputStream works.)
The PR also has some notes about the potential to clean things up further and implement other mimetype detectors that only require a subset of bytes in a more extensible framework. I don't have any plan to implement this but incremental steps to add other detectors to the direct upload path are probably possible if there are other cases where we're seeing problems.
Which issue(s) this PR closes:
Special notes for your reviewer: I also added a note that I think there's a no-op section now where we check files by extension when the extension is null. If someone knows more about the history there, perhaps we can restore whatever was supposed to happen there, or we could delete it.
The code also makes a slight logging change. After the code was updated to allow getting Mimetypes using the full filename, we've been getting info-level suggestions to perhaps add that specific filename to MimeTypeDetectionByFileName.properties which is pretty noisy. I changed the code to keep info-level if the extension is checked and isn't in MimeTypeDetectionByFileExtension.properties but dropped it to fine for the filename check.
Suggestions on how to test this: Find a Stata file with version 14 or 15 and upload it via direct upload/using a direct upload store, verify that it gets a mimetype including the version and is ingested (assuming ingest size limits, etc. allow). (Some Stata files have a *.dta extension - searching for that might help in finding an example.)
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?: included
Additional documentation: slightly updated the docs to indicate the Stata check is now done in direct upload.