Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata for many fragments of a recording #271

Open
PeterisP opened this issue Feb 10, 2023 · 4 comments
Open

Metadata for many fragments of a recording #271

PeterisP opened this issue Feb 10, 2023 · 4 comments
Labels

Comments

@PeterisP
Copy link

The Core spec supports a scenario where there are many capture segments in a recording data file and the full data file is available.

We would like to apply SigMF for a scenario where the dataset is provided as separate IQ files for each of the capture segments containing extracted transmission packets (separated both in time and frequencies i.e. channels) - and the full original recording is not provided, so that the non-packet time and frequency ranges are discarded to make the dataset size more manageable.

We would like to request some way to alter the capture segment Object definition to include a reference to a specific dataset file instead of the index into the global sample stream. Perhaps this overlaps with #245 .

Alternatively, this might be solvable by treating this as a SigMF Collection (i.e. when extracting packets, convert a single SigMF Recording to many SigMF Recordings, one for each packet), however, in this case we would need a way to represent a Collection of Collections, as there would be many linked recordings each with many extracted packets.

@jacobagilbert
Copy link
Member

jacobagilbert commented Feb 10, 2023

Hi @PeterisP - I think I understand generally what you are looking to see here but I could use a little more info to best answer this. My initial thought is to steer you toward the Collection idea, but can you explain a little bit more about the need for a "Collection of Collections"? After chopping up multiple files and extracting the bursts, I suppose I would consider the initial concept of the base (pre-chopped) files some what irrelevant at that point and you could flatten those into "just one collection", maybe you can describe why that would be problematic if that is the case?

Right now (as you identified) a SigMF Recording is the core component of SigMF data and is comprised of exactly one data and one metadata file, which was a very intentional decision. Changing that would represent a fairly large change in how SigMF is defined, so there would need to be a compelling reason, which I think is still lacking for #245.

@PeterisP
Copy link
Author

PeterisP commented Feb 11, 2023

The initial concept of the base (pre-chopped) files is relevant to me because the expected analysis of the dataset requires treating the packets as part of a single conversation - linking a request fragment with the response to it, tracking e.g. the time offsets between packets, tracking the channel hopping pattern, performing payload decoding from many packets to a continuous stream; so there is a need for some metadata structure that links those segments together.

The other issue is that of practicality of file sizes. The use case I have in mind is a dataset of many radio recordings of the same actions done with many different devices, as data for device behavior analysis and fingerprinting. Keeping every packet as a separate file is not optimal, as every recording may have 10000-100000 fragments(packets), and some filesystems have serious performance issues with datasets consisting of millions of separate files, so to me it seems desirable to store each recording as a SigMF Collection in a SigMF Archive. On the other hand, flattening all recordings in a single Collection would mean a single Archive which is impractical as the whole package can be very large (e.g. many terabytes) which is awkward to distribute and may have performance issues when seeking for specific packet files within the .tar. And keeping each recording in its own Collection/Archive implies a need for some index to summarize them (a file listing every recording, but not every single data packet in that recording) i.e. a "collection of collections".

@jacobagilbert
Copy link
Member

Ok, thanks for expanding. Having tens of thousands of files open (and dealing with the associated thrashing) is certainly something to avoid, so that makes a lot of sense.

One other thing that comes to mind is to reduce each raw data file into just the segmented parts of interest concatenated together. You can make use of the captures scope core:sample_start and core:global_index fields (and optionally core:datetime) to identify the temporal boundaries of the original recording for each segment within that single chopped and concatenated data file. This would permit metadata specification very much like you are requesting, massively reduce file count, and adds essentially no complexity in terms of data access (its a trivial seek operation per segment). These SigMF Recordings could then easily be associated using an existing sigmf-collection metafile.

This is an approach used by a lot of people and also happens to work nicely for decimated and dehopped FHSS data because each segment can have its own frequency; really the main limitation is that they be at the same sample rate (a very strong requirement for any single SigMF Recording).

@jacobagilbert
Copy link
Member

jacobagilbert commented Mar 1, 2023

@PeterisP I've continued to think about this, and allowing "collections of collections" seems to make a lot of sense to me. I have opened #272 for discussion of this aspect of your issue in detail.

Were you able to look into using concatenation and captures segments to accomplish what you need?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants