-
-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CEP] Form XML Compression in BlobDB #26813
Comments
@satyaakam FYI |
This proposal seems like a good idea to me.
Will doubling the number of writes (save each uncompressed XML blob, and then later save a compressed copy of the same) have any negative effects on the system? I.E., storage pricing, extra network bandwidth requirements, etc.
This is my preferred option due to the simplicity of implementation and reasoning about how the system works.
I assume we can measure the impact, but I would expect the cost of decompression to be negligible, possibly even outweighing (less than) the cost of transferring a larger object over the wire. -- Aside: 2KB/form is pretty small. I wonder if we could easily measure the difference in cost (of storage fees and operational performance) between storing XML in a system like S3 vs storing it in PostgreSQL? Hosting fees at ICDS scale may be prohibitively expensive for Postgres, but I imagine there is a point (smaller scale) at which the cost of maintaining a separate storage backend is more expensive and/or complex (a different type of expense), especially for hosting environments that run a local S3-compatible storage service. |
@millerdev thanks for your input. I had not factored in that decompression is much faster and simpler than compression. I think you are correct that the compression should rather happen during the initial request processing which does make things much simpler. |
Thanks for looping me in @snopoke . Two thoughts on this approach:
Storing the notion of whether a blob is compressed has a few other advantages, including that it gives us an easier index to start processing uncompressed blobs to recover storage. Long-term / moonshot it also gives us a place to store more complex info if we ever want to really get fancy with this. Daniel and I talked a long time ago about concepts like cold storage where we take forms which are effectively archived and collect up, say, 10-100 forms of the same type and compress all of them into one GZIP'd blob (which could get us another 10x compression ratio), and then store an index that says (Blob XYZ, Format: GZIP'd collection, item: ABC.xml). That's a lot easier if we build around the notion that the metadata record is a rich index to an abstract storage, rather than just a pointer. |
@snopoke Did a quick (very low-N) test on compressing the idea of compressing forms together in batches of the same type. Ran these on the most growth monitoring forms which are smaller, and compress less well, which is why the starting average is lower. Leaving the numbers here in case useful to get an idea of the trend.
|
Just to confirm here, you are suggesting we compress ALL data in the BlobDB as a matter of course - or based on some threshold. The main reason I didn't take this route is that we are only talking about form XML which is one of 17 different types of data we current store. Granted it is the most common type by a large factor. We do already decide explicitly to store compressed data in may places in HQ code e.g. data exports so this would be in keeping with that approach where it's not the BlobDB's prerogative to decide when compression is necessary. Changing that design is a much larger shift which would require changes to the core BlobDB API.
I did consider this but since the current plan is to do the compression outside of the BlobDB it didn't make sense. If we were to use the metadata to determine if a blob is compressed or not we'd also have to change the BlobDB API currently does not read the metadata when fetching blobs (since it only needs the
Since this proposal is only considering form XML the decompression would only be applied to code that is fetching those objects and not anything else. It is also not being applied inside the BlobDB.
Yes, this is true though that could also be accomplished by simple date based filtering. This field would lose value once all XML is compressed (but I guess we could drop it after that). There is already a metadata field in the BlobDB model but it is an un-indexed text / JSON field so we couldn't use it for this purpose anyway. If we do go ahead and put this in the BlobDB itself then having the metadata field would be essential.
I don't think this has any bearing on the current proposal. Regardless of how the XML compression is done it would still be possible to do the batching at a later stage. The info required to keep track of batched blobs is very different from a single 'is_compressed' field so I don't think it's much of a consideration unless we come up with a more flexible concept of blob metadata. My main concern with that is it's over engineering at this stage. @millerdev I'd be interested to hear your thoughts on whether this should be implemented as part of the core BlobDB API or kept seprate. |
@snopoke Thanks for the clarifications. I think my mental model of the blobdb's role is more abstraction based (IE: I would have assumed that it was a transparent bits-in-bits-out interface which makes decisions about things like compression in the blobdb layer) than the current implementation. Agreed that my changes would be a bigger shift given that isn't how it's approached currently. |
It does seem advantageous to use a database field to indicate that a blob is compressed rather than inspecting the content for magic bits. The latter seems easier to exploit in surprising ways. Blob metadata is a perfect place to store that information. On the other hand, as you point out, blob metadata is not always loaded prior to loading the blob itself since currently only a "key" string is needed to load a blob using the blob db API. So we cannot store the compression flag on the metadata record for any blobs that are loaded by key alone. On the other other hand, SQL form attachments (including form XML) are always loaded via their metadata (I think I'm right about that, but feel free to prove me wrong). We could require blob types that may be compressed (only I'm leaning toward +1 on adding a compression flag on the blob metadata, even if it is only ever used for |
Since compression is more expensive compared to decompression, would we consider the option of mobile itself sending gzip compressed form XML? That would mean, only while reading we need to decompress it, but we can store it right away without the overhead of compressing. |
@sravfeyn long term that could definitely be viable, but the lead/lag time for those kind of performance improvements to make it back upstream to HQ for the big use cases is really, really long. Also a little tricky to figure out how to establish compatibility without sending up multiple copies of the form (one compressed, one not), so I think it's a pretty high LOE change for the value it'll create. |
I've updated the CEP to reflect the comments and discussion. |
This is shaping up well. Regarding the
OR
def get(self, key=None, type_code=None, *, meta=None):
key = validate_get_args(key, type_code, meta)
blobstream = ....
if meta and meta.compressed:
return GzipFile(fileobj=blobstream, mode='r')
return blobstream
def validate_get_args(key, type_code, meta):
if key is not None or type_code is not None:
if meta is not None:
raise ValueError("'key' and 'meta' are mutually exclusive")
if type_code == CODES.form_xml:
raise ValueError("form XML must be loaded with 'meta' argument")
if key is None or type_code is None:
raise ValueError("'key' must be specified with 'type_code'")
return key
if meta is None:
raise ValueError("'key' and 'type_code' or 'meta' is required")
return meta.key Edit: update order of checks in |
Abstract
Form XML makes up 70% of the BlobDB data on ICDS. In environments with fewer form attachments this number is likely much higher. This CEP proposes to compress the XML data to reduce the storage requirements of the BlobDB.
Motivation
The form XML is saved to the BlobDB uncompressed and as stated above it makes up at least 70% of the data in the BlobDB. Basic tests show that compression can give considerable reduction in storage.
On a sample of 10000 ICDS forms:
Specification
Since this is an isolated use case the compression and decompression will not be handled by the BlobDB.
This specification allows for the compression of the XML to take place in three different places. The purpose of this is to allow experimentation to determine the best place for the compression to take place based on data from actual usage (see Metrics section below).
Compression
The compression should use
gzip
since it offers good compromise between compression ratio and speed. It is also well supported in Python.Metrics
The following metrics should be added to allow monitoring of this change both to ensure it is correctly functioning and to provide data to better understand the access patterns.
commcare.form_submission.read_compressed
This metric will give us insight into how many times a compressed XML is read and where (which process) is it read.
Tags:
get_current_request().resolver_match.view_name
)get_current_task().name
)Changes to CommCare
This should be implemented within the BlobDB in order to make it transparent to calling code.
Question: should this be indexed?
put
API to compress data based on the type codeSince we only want to compress form XML at this stage the decision to compress can be based on the type code. Any calls to
blobdb.put
with type code ofCODES.form_xml
(2) should compress the content before saving it to the backend.In addition to compressing the data the blob metadata must record that the data was compressed by setting the
compressed
flat toTrue
:get
API to support decompressingCurrently the
get
API only takes the blobkey
which is used to fetch the blob directly from the backend. In order to support decompression the API will need to be modified to pass in themeta
object as well:Impact on users
No impact on end users is expected.
Impact on hosting
No impact is expected for hosting CommCare. This change should be completely transparent.
Backwards compatibility
Since we will likely not compress old data (at least not initially) it is a requirement that the code support compressed and uncompressed XML. This requirement will make the change backwards compatible by default.
Release Timeline
End of Q2 2020
Open questions and issues
metadb.content_length
be the compressed or uncompressed sizeThe text was updated successfully, but these errors were encountered: