Compression for SigMF Archives #68

bhilburn · 2017-09-20T15:10:53Z

Item from GRCon discussions:

So, we had previously decided in the discussion about archive formats #15 to dis-allow compression, with the reasoning that compressing IQ recordings rarely gives you anything and complicates the reader / writer applications. But, a couple of folks from the National Labs pointed out that you sometimes have recordings of mostly zero values in some systems. They are currently using HDF5 and considering moving over to SigMF in a number of their programs, but would like to see a compression capability, which HDF5 allows for.

I think this argument is reasonable, and it does make sense. There will definitely be recordings where the Vpp is not changing so much as to make compression useless.

I'd like to solicit thoughts and opinions on both this topic, generally, as well as what forms of compression might make the most sense for SigMF archives.

djanderson · 2017-09-20T16:11:46Z

Not having a compression format in the spec doesn't stop people from using compression, though?

Using tar (which we've used as our archive format) as an example, tar doesn't itself specify any compression, but it's almost always compressed, with the compression format tacked on after .tar, so .tar.gz or .tar.bz2, etc.

Could we recommend a similar syntax, where we use .sigmf.gz, .sigmf.bz2, .sigmf.whatever_works_for_you?

Pros:

simplicity of the spec
simplicity of reader/writer applications (they only have to support untarring/tarring, leaving the user to use an external {de}compression utility first, or they could detect the common compression formats and do that for you as well).
flexibility to use best compression format for your application

Cons:

someone could theoretically compress all their sigmf data with some tool/format that is not free, not OSS, not available on all OS's, etc

kpreid · 2017-09-20T16:22:54Z

This is a side issue to the question of doing it at all but note that extensions like .sigmf.gz are bad for use in (at least some) desktop environments because .gz would end up matched to a generic decompressor rather than a SigMF viewer which can decompress transparently. This is why when I proposed standardizing an archive format I specified that the extension should be .sigmf and not .tar.

djanderson · 2017-09-20T16:32:52Z

I'm actually using the SigMF archive format, and I can tell you that 100% of the time I just change the extension to .tar when I download the file so that the built-in archive utility recognizes it and opens it. That's fine, and I get and agree with the point of using the .sigmf extension, but also hiding the compression format would just make the file impossible to open with external tools on a system without a dedicated sigmf reader installed.

You can always open the files directly in the sigmf reader that supports compression to avoid it being picked up by a decompression utility first.

mbr0wn · 2017-10-03T19:33:36Z

We can also, very easily, add tools for common OS's to identify .sigmf as tar, and .sigmf.gz as tar.gz etc. I like @djanderson's approach -- it's common practice, and solves the problem making it not-our-problem. The issue that people could start using non-free compressors is a thing, but when does that actually happen these days? WinRAR anyone? I haven't used ARJ since the era of distributing video games on 3.5" floppies 👴

kpreid · 2017-10-04T03:00:31Z

We can also, very easily, add tools for common OS's to identify .sigmf as tar, and .sigmf.gz as tar.gz etc.

@mbr0wn I don't think there is usually the ability to assign a meaning to an extension with another dot in the middle, as .sigmf.gz has. I could be wrong, but I've never seen it actually done. An extension like .sigmfgz would not have this problem (if it is a problem) and would still be documenting the format.

bhilburn · 2017-10-26T20:19:03Z

It doesn't sound like anyone has any arguments against allowing for compression, so great! We'll move ahead with that. Now, to figure out how to spec it, hah =)

@kpreid has an interesting point regarding enabling transparent reading (and thus decompression) if we have a unique extension. Using OS built-ins can be really convenient, too, as @djanderson.

It seems like a key question is: do we expect most people to be interacting with SigMF Archives using archive utilities or SigMF applications? Let's say we have a SigMF reader application that is popular and widespread - how many people will still want to just download the archive, peek into it, and open the metadata in vim / less / whatever?

To be honest, I don't think I have a strong opinion one way or the other right now. I'll spin on it a bit, and am really interested to hear everyone else's opinion, if you feel strongly about it.

cityscapesc · 2017-11-01T01:47:50Z

Just want to confirm that some of my real-world raw I-Q data files do compress very well (50% or more). int16 interleaved; could be more effective if the data is in ieee754. This feature will help transferring data files over the Internet a lot.

One alternative to compressing the whole archive file could be compressing the raw I-Q data file only, and offer a SigMF label for this. (compression_type: raw, gz, etc.)

bhilburn · 2017-11-02T20:35:14Z

@cityscapesc - Thanks for the input and data point!

You raise an interesting idea about just compressing the datafile. The primary issue I can think of, there, is that then we are almost forcing Reader / Writer applications to have de-compression libs built into them. If we do it at the archive level, it could be a more natural part of the workflow to do it prior to application loading. Either route carries some risks, though, in terms of application complexity.

I think @cityscapesc's data point is a good one, though, and backs up the feedback we got at GRCon: compression would be useful.

So, going back to the question of extension: I just pulled up the documentation on MIME Extensions, and as it turns out, the specification is that the longest pattern has the heaviest weight. From https://specifications.freedesktop.org/shared-mime-info-spec/shared-mime-info-spec-latest.html#idm140625828677088 :

If several patterns of the same weight match then the longest pattern SHOULD be used. In particular, files with multiple extensions (such as Data.tar.gz) MUST match the longest sequence of extensions (eg '.tar.gz' in preference to '.gz').

My read of this is that we could safely do sigmf.gz, as long as we also ship a MIME Extension XML definition with SigMF. Thoughts?

bhilburn · 2019-07-12T15:28:55Z

We've seen in the past couple of years that most people are compressing SigMF recordings - even when the gains from compressing the raw samples are low (as expected), in recordings with siginificant annotations, they can be meaningful.

In #99, I sort of sloppily added support for gzip. Realistically, supporting gzip and bzip2 are probably safe given their ubiquity. It's entirely possible people will want to use other forms of compression, but making those canonical would be problematic on some systems (thinking of 7zip, for example).

My inclination is to allow compression of archives using gzip and bzip2. I'm leaving this issue open for disagreeing opinions for a bit longer, in case anyone wants to comment.

n-west · 2020-04-24T15:06:22Z

So I have an alternate proposal. From the conversation I've picked up on the following desires:

use some compression to shrink the filesize because the data portion is large
be able to use tools to work with this compressed record

I think that the experience of using a library to work with a compressed tar would be fairly painful because it requires you to decompress the whole archive in order to read the metadata and provides 0 random-access to the data (so you have to decompress and untar the whole thing). This was also pointed out by @citscapesc. I propose

add an option around the datatype to compress just the data file portion of a record using compression formats that support psuedo-random access (see https://stackoverflow.com/questions/429987/compression-formats-with-good-support-for-random-access-within-archives)
leave the archive as is. If people want to compress the archive that would be OK too, but there's marginal benefit to it

This would allow a lightweight reading of the metadata, then randomly seeking to samples of interest without decompressing the entire archive. This means it's now possible to deal with very large recordings in a compressed manner without a very large memory requirement and no need to make a duplicate decompressed copy on disk. The expense is that of CPU power to decompress on the fly.

Lots of attention and care to selecting the right compression format is required, but I'm curious if this sounds like a reasonable approach to the folks with the use case for it.

bhilburn · 2021-05-26T14:58:49Z

@n-west - I like your suggestion a lot. Reading that thread, it actually sounds like both gzip and bzip2 are also capable of PR-access of the compressed blocks, but it's not clear to me if those are natively available in the primary packages that are distributed.

I would like to have the feature described by Nathan 👆👆 in v1.0.0. Marking as such

jacobagilbert · 2021-07-29T12:49:38Z

I don't see this being ready for v1.0... moving to 2.x (can pull in to 1.1 or something sooner if we want to, this is a new feature and should not break anything)

dkozel · 2021-11-01T17:40:49Z

As a possibly related bit of info, there's IQZIP by the Libre Space Foundation which implements a CCSDS lossless compression standard. They did some interoperability work with SigMF and it looks like that was removed.
https://gitlab.com/librespacefoundation/sdrmakerspace/iqzip

bhilburn added enhancement suggestion labels Sep 20, 2017

bhilburn self-assigned this Sep 20, 2017

djanderson mentioned this issue Oct 27, 2017

Stream support #60

Closed

bhilburn mentioned this issue Oct 2, 2018

Multiple Recordings Extension #99

Closed

bhilburn added this to the Release v0.0.2 milestone Jul 12, 2019

bhilburn modified the milestones: Future (v2.x), Release v1.0.0 May 26, 2021

jacobagilbert assigned jacobagilbert and unassigned bhilburn May 27, 2021

jacobagilbert mentioned this issue May 27, 2021

Allow Freq (but not temporally) Bounded Annotations #128

Merged

jacobagilbert assigned n-west and unassigned jacobagilbert May 27, 2021

jacobagilbert modified the milestones: Release v1.0.0, Future (v2.x) Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compression for SigMF Archives #68

Compression for SigMF Archives #68

bhilburn commented Sep 20, 2017

djanderson commented Sep 20, 2017

kpreid commented Sep 20, 2017

djanderson commented Sep 20, 2017

mbr0wn commented Oct 3, 2017

kpreid commented Oct 4, 2017

bhilburn commented Oct 26, 2017

cityscapesc commented Nov 1, 2017

bhilburn commented Nov 2, 2017

bhilburn commented Jul 12, 2019

n-west commented Apr 24, 2020 •

edited

Loading

bhilburn commented May 26, 2021

jacobagilbert commented Jul 29, 2021

dkozel commented Nov 1, 2021

Compression for SigMF Archives #68

Compression for SigMF Archives #68

Comments

bhilburn commented Sep 20, 2017

djanderson commented Sep 20, 2017

kpreid commented Sep 20, 2017

djanderson commented Sep 20, 2017

mbr0wn commented Oct 3, 2017

kpreid commented Oct 4, 2017

bhilburn commented Oct 26, 2017

cityscapesc commented Nov 1, 2017

bhilburn commented Nov 2, 2017

bhilburn commented Jul 12, 2019

n-west commented Apr 24, 2020 • edited Loading

bhilburn commented May 26, 2021

jacobagilbert commented Jul 29, 2021

dkozel commented Nov 1, 2021

n-west commented Apr 24, 2020 •

edited

Loading