Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema vs Model Distinction #70

Open
bhilburn opened this issue Sep 20, 2017 · 17 comments
Open

Schema vs Model Distinction #70

bhilburn opened this issue Sep 20, 2017 · 17 comments

Comments

@bhilburn
Copy link
Contributor

One of the comments we got at GRCon about SigMF is that it seemed to make working with datasets difficult.

Specifically, what this person wanted to be able to do was SELECT something in a database, parametrically, based on the metadata, and then have it return a chunk of samples. The obvious solution is chunking the SigMF data file by capture segment and then storing those chunks with the segments as keys - but this no longer represents a compliant recording per the standard. Possible? Yes. But not standard.

Is this something we should address? I agree that it is a useful structure and I think a lot of users will want to use something like it. Even if we don't want to make this a compliance requirement, are there things we can do in the standard to make it easier to accomplish?

@kpreid
Copy link
Contributor

kpreid commented Sep 20, 2017

I claim that as a general principle of software engineering, one should not call an application noncompliant just because:

  • it internally, as an implementation detail, stores data in a format different than the standard, or
  • is capable of returning the data in a nonstandard but useful format.

Rather, compliance of an application should be such conditions as:

  • it can read/import/intake files in the standard format (if the application reads such files);
  • it can write/export files in the standard format (if the application writes such files);
  • it does not produce nonstandard-format files it claims are standard; and
  • there are no compliant files which it cannot read, other than due to size limits.

@djanderson
Copy link
Contributor

@bhilburn, did you get any more insight into what specifically makes it difficult with databases? I think the fact that we split metadata from data, break data into capture segments, and provide unique keys in the form of sample_start to find those capture segments makes it pretty straight forward to load into a database. For the record, I'm storing SigMF data in a relational db, though I don't give each capture its own row. Though I do store in a db for more efficient searching/filtering/seeking into data, I wouldn't want the actual sigmf format to be anything other than a flat file.

I'm honestly not sure what we could do to make SigMF easier to drop into a database, and as @kpreid said, there's nothing about the spec that stops or even discourages them from creating an application that does so.

@bhilburn
Copy link
Contributor Author

bhilburn commented Sep 20, 2017

The biggest proponent for this, actually, was @namccart. He was explaining that one of the reasons that he really likes VITA49 for this particular application is that it provides pre/easily 'chunkable' data.

So, based on my understanding from @namccart, for example, if you load a SigMF recording into a database and search over sample_start as a key, once you identified one you wanted you would then still have to load the entire dataset to index to the key. As you said, @djanderson, "[you] don't give each capture it's own row", which I think is what Nick's issue is?

Nick, can you comment?

@mbr0wn
Copy link
Contributor

mbr0wn commented Oct 3, 2017

I'm also not convinced this is really a SigMF problem. I see how it makes writing SQL <-> SigMF converters a bit more complicated, but they also solve really, really different problems.

@namccart
Copy link

namccart commented Oct 4, 2017 via email

@bhilburn
Copy link
Contributor Author

bhilburn commented Oct 9, 2017

Okay, so, SigMF already provides a solution to this, but we should discuss whether there are changes that would improve it:

So, what @namccart cares about, per my comment above, is the ability to load smaller "chunks" of data than the entire dataset, which makes it much easier to work with databases. SigMF allows for this using the offset field of the core namespace, which allows you to break datasets up into multiple files that represent a continuous recording. You could, for example, break a dataset into five .sigmf-data files that have five matching .sigmf-meta MD files, with offsets that connect each one to the one that precedes it.

So, the question here, then, is "What, if anything, could we do to make this better?" Is there some change we should recommend? If we just provide a tool that cleanly splits your dataset into multiple files, of a parametizable size, does that solve the issue?

@bhilburn
Copy link
Contributor Author

Spun on this a bit. @kpreid had a really good point, early on, that we shouldn't call something non-compliant because of anything it does "internally" or "locally". It's really all about the ingress and egress.

Per my previous comment, what @namccart wants to do is already pretty doable with SigMF. We could make it easier by providing a tool, for example, that showed you how to chunk the data based on metadata segment, but there really isn't anything difficult, here, in my opinion.

So, I think the final question that should be debated is whether or not this is a format that we want to be able to distribute SigMF Recordings in. Right now, a compliant recording can not be distributed where the binary data has been chunked into a bunch of files and one metadata file references all of them. We specifically decided against allowing the 1-to-many case in #19. It was in the context of multiple streams of data, but the reasoning still applies here, I think.

So, before we close this issue out with either a do nothing or a make an example chunker program decision, does anyone think we should revisit 1-to-many given this usecase? We do now have an archive format described, which we didn't at the time of #19, so it would (presumably) be easier to distribute multiple files in a recording.

@dharasty
Copy link

dharasty commented Nov 13, 2017

I'm new here, so if these comments are missing the point, I apologize.

One feeling I had as I read the spec (as an experienced spec reader and writer) is that the current draft spec conflates semantic content of the metadata with the transfer encoding/format of the data.

In plain English: it seems to me the definition of "what are the allowed tags and values in SigMF metadata" can (and should) be separate from HOW the tag value pairs are encoded.

I'm all for SigMF metadata including "datatype" and "sample_rate" and "version", and so forth; consider this the "schema" of "SigMF metadata". But I think the spec would be strengthened by separating out the fact that "it must be a JSON file".

I feel the SigMF spec SHOULD say: when SigFM metadata is written to a file, it then must be a JSON structured UTF-8 file, with a single object per file, and use the following extension.

If ALSO a standard way for "writing a SigMF object to a SQL database" is needed... then that should be specified as an alternate way to store an SigMF metadata (and maybe the dataset, too).

Should one write the JSON version of the metadata as text-blob to a single VARCHAR field? Or should each field of the metadata get its own SQL field? Personally, I don't care; I find both of these reasonable in certain cases. Should the SigMF spec weigh-in on the "correct"/standard way to do this? Only if the community thinks it is helpful.

And then what if I want to store SigMF data -- both metadata and the dataset -- in a document database such as MongoDB? Do we need to define a "standard" -- that is "compliant" -- way to do that?

MY MAIN POINT is that because the verbiage of the spec conflates "SigMF metadata is a JSON object with this format", I think it leads to the ambiguity that is being discussed in this thread.

My advice: separate sections for the semantic part of "what is SigMF metadata", and then requirements for how they should be serialized into a file (JSON), and -- if desired by the community -- recommendations for "best practices" when stored in relational records, or a document database, or -- as needed -- in other portable files/containers/transfer mechanisms.

@kpreid
Copy link
Contributor

kpreid commented Nov 14, 2017

I agree that distinguishing consistently between schema/model and encoding would be useful, but I think that "separate sections" is a bad idea unless those sections are interleaved: the value of making the distinction clear is less than the value of making it obvious how to implement SigMF's intended primary use — an interchange file format.

@dharasty
Copy link

dharasty commented Nov 14, 2017

@kpreid: It is a pretty common technique in standards documents to separate the schema from encodings. In fact, in many standards documents, ALL the encodings show up as examples/supplementary information in appendices.

For SigMF to really catch on, I think it need to address is motivating use case of FILE interchange, but ought to give SOME consideration for logical next steps, such as storing both the dataset and metadata in either relational and document databases. (After all, a filesystem and a tarfile are simply ONE instance of a "document database" or "document datastore".)

Actual file storage might be many users primary use case... but for me, it probably won't be. Minor adjustments to the contents and the format of the spec might ensure my use case is well covered, too. This will be a boon to the spec if we can achieve it without impeding the file use case... and I feel we can.

All that said, I have no trouble if we inline/interleave JSON-file examples in the text, provided there is 1) a clear editorial distinction between "schema requirements" and "JSON-file encoding requirements", 2) there is some other place in the document that addresses the needs of other encodings (possibly appendices).

@bhilburn
Copy link
Contributor Author

So, it's taken me far too long to address this.

@dharasty - I think you make really excellent points, and I appreciate you providing your insight, here. I would like to make the change you suggest (i.e., distinguishing between schema and model) as part of the v0.0.2 stuff I'm hacking on, now.

I'm interested to know your thoughts on the best way to go about doing this. Is there any chance you would be up for putting together a PR that demonstrates an approach you thinks works well?

@bhilburn bhilburn changed the title Compatibility with Databases Schema vs Model Distinction Jul 17, 2018
@bhilburn
Copy link
Contributor Author

Some minor changes that clearly distinguish between the schema and file encoding will be made in the v0.0.2 release per the discussion above.

@bhilburn bhilburn added this to the Release v0.0.2 milestone Jul 12, 2019
@jacobagilbert
Copy link
Member

jacobagilbert commented May 27, 2021

I feel like this is an important conversation, but it should probably be pushed to v1.1+ so as not to delay the timely release of v1.0.0.

@gmabey
Copy link
Contributor

gmabey commented Jun 14, 2021

@bhilburn do you agree with @jacobagilbert 's comment? I do.

@gmabey
Copy link
Contributor

gmabey commented Jul 9, 2021

@bhilburn ping

@bhilburn
Copy link
Contributor Author

I actually think the fundamental change we are talking about here is super simple and pretty light-touch. I'll get a PR together that does it once we've got the major churn done (merging #135 and #140)

@gmabey
Copy link
Contributor

gmabey commented Aug 4, 2021

@bhilburn It is pretty exciting to this that this is the only issue still languishing in the "Not Started" bucket for the 1.0 release ... I wait with baited breath for progress :-D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants