-
-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CEP] Case snapshots #27315
Comments
This statement seems ambiguous. Would it be accurate to say "A rebuild of a case should refresh any snapshots newer than the date of the rebuild." What operations are involved when a snapshot is "refreshed"? Is it similar to form deprecation where the old snapshot is deprecated and the corresponding case transaction is revoked? Is it correct that the new "refreshed" snapshot will have a case transaction with the same
What would trigger this operation? F3 was edited? Why have multiple snapshots per case? Would it meet the objective to make a case snapshot operation that replaces forms older than a given data availability cutoff date with a snapshot form per updated case? In this scenario there would be at most one snapshot form per case, and it would always be the first/oldest form transaction associated with the case. Given a batch of forms to be snapshotted, once a snapshot form is created for each referenced case, the replaced forms can be safely deleted and/or moved elsewhere for further analysis. The snapshot process could be setup as a long-running resumable task that operates on batches of forms, where form ids in each batch are upserted into an The snapshot process could be run multiple times, each time with a newer data availability cutoff date. Multiple concurrent runs with different cutoff dates currently has undefined outcome in my head, but it may be possible to make that safe if needed. One advantage of this scenario is that there is never a necessity to "refresh" a snapshot after it has been created. |
Updated
Since we'd need to maintain the date of the transaction I think this would either involve what you suggest, revoke + create new with correct date, or it would just overwrite the previous blob (2nd option seems a bit more dangerous and less atomic).
An edit or an form archive
In your scenario the only way to generate the snapshot would be to take the previous snapshot and replay all the new forms on top of it to get the new state. This would need to be done for every single case which is analogous to reprocessing every single form in the timeframe. The reason I didn't go with this option is because of the vast amounts of processing required to do it. |
Ah, I see. I thought that was also implied in this sentence from your proposal:
But I see now that the case blocks for the snapshot can be generated directly from the case. Thanks for hearing me out and clearing that up. |
Note: I'm going to try and update this with some of the details about form archiving since I think it makes sense to think about them together. |
Updated:
|
It seems desirable to design the system such that it cannot get into a state where a form can be viewed (in a way that a user may consider performing an edit / archive / unarchive operation on it) that is associated with a case that does not meet the ‘case rebuild check rules’. In other words, it should be impossible to find any form in the system that cannot be edited, for example. Can we achieve this design goal? If not, why? Use of the word "archive" is potentially problematic because it conflicts with the current "form archive" procedure, which as far as I can tell is completely unrelated to this new kind of form archival. Example uses in this CEP:
Consider adopting a different term for what happens to forms beyond the data availability cutoff date. |
We could get around this if we have Few questions
|
100% I was looking for alternatives but didn't come up with anything. Got any suggestions?
Yea, I'm not sure about this - I think this needs to be balanced with making data available to the users. I think there are a lot of details that need to be considered before we can actually remove any form data. The focus of this CEP is the case snapshots. |
As mentioned in above comment I think there are still a lot of things that aren't addressed here with regard to removing form data which is not the focus of the CEP. As far as the snapshots showing up in case history - I think we could exclude them from reports etc.
"The creation of a case snapshot should be triggered by a form transaction for the case". i.e. it is triggered during form submission. I think doing it synchronously would make the most sense.
There is not process that's removing form data as yet. I think there's still a lot of work to do before we can start that. This CEP is one of the pieces. |
What about inactive cases? Wouldn't they also need snapshots if their forms are to be archived?
Imagine a case with some updates, including some snapshots, but where the most recent snapshot is older than the data availability cutoff
This seems like a boundary condition could occur where a user submits data a day too early to trigger a new snapshot, then their supervisor wants to make a change to that submission. This case won't be available for edit or archive until the next form submission triggers a snapshot, even though the most recent form submission was only a day or so ago. Is my understanding of this correct? I was assuming that the max snapshot age would be the same as the data availiability window, but looking back over your original comment, I see that's not actually stated. In any case, it sounds like the interplay between the two would be crucial in determining what window of data is guaranteed to be fully available. One approach to this would be to require that max snapshot age be less than half of data availability window, and so ensuring that any activity within the max snapshot age is fully available for edit. Or introducing a third concept of data editability window, which is defined as Alternatively, this data expiration could be approached not as a rolling window, but as a series of horizons. For example, you have full access to data from the past 3 months, but on June 1st, you lose access to data not modified since February, and on July 1st, March, and so on (snapshots would have to be made during the first update in each new month). This sounds harder to work with in code, but might be easier to explain to partners and implementers.
Seems like we'd want fast and cheap access in queries to whether a case can be edited or archived. For instance, perhaps a
I've heard the term "freezing/frozen" used for this sort of thing elsewhere. Eg AWS Glacier, ES frozen indices. |
+1
If a case doesn't get modified then successive snapshots would be identical to each other and don't add any benefit. In terms of max snapshot age and data availability cutoff I had thought they would be of this order of magnitude.
The 'freezing' of form data doesn't have to match up exactly with the cutoff, I think it may make sense to have the actual 'freezing' process lag the cutoff by max snapshot age to maximize the likelihood of a snapshot being available. Using a series of horizons is also likely to be how it get's implemented particularly if we use rollover indexes or similar mechanisms to actually do the 'freezing'.
Editing and archiving aren't very common operations and the rules for allowing it based on the case snapshots are very dependent on the date of the form so I don't think it will work to have a case property storing the date of the last snapshot (though that will likely be useful for the snapshot process itself). When displaying a form a simple query to the cases would allow us to know if it is editable / archivable: select 1 from case_transaction
where case_id in (case_ids) and type & $SNAPSHOT = $SNAPSHOT
and server_date < $form_received_on and server_date > $data_availability_cutoff |
Abstract
To allow old form data to be archived from the system without impacting the integrity of the case data model.
Cases are built by successively applying the case transactions extracted from form data. Without ready access to ALL the forms required to make up a case no case transactions can be rolled back or archived since the case cannot be correctly rebuilt if there are missing forms.
This CEP proposes the use of periodic case snapshots which allow form data beyond a certain point in time to be removed without impacting the integrity of the case after that point in time.
Motivation
For long running projects having access to raw form data becomes less useful as the data ages. Conversely the cost of keeping the data remains the same since CommCare requires ‘fast’ access to the form data for certain workflows e.g. case rebuilds due to form archiving.
Without a way to isolate the case data from the full history of the case there is no way to safely remove form data from the system.
Specification
Glossary
Snapshots should only be created for the current state of a case (it would be too costly to to generate historical snapshots for all cases). A snapshot must include all the relevant fields necessary to reset the case to the exact state at the time of the snapshot. This should also include the state of any ledgers associated with the case.
The implication of this is that this feature will not be immediately useful since snapshots are generally only useful further back in the timeline. In terms of meeting the goals of this CEP, archival of form data will only be possible once there are historical case snapshots for all active cases up to the data availability cutoff.
The creation of a case snapshot should be triggered by a form transaction for the case. A case that is not receiving new form transactions should not have a snapshot generated even if it’s last snapshot is older than ‘max snapshot age’.
A snapshot should be created for a case that meets the following criteria:
OR
A rebuild of a case should refresh any snapshots newer than the date of the rebuild. Refreshing a snapshot can be done as follows:
Notation:
C = case
FX = form X
FXx = edited form X
SX = snapshot form X
SXx = updated snapshot X
C = F1, F2, S1, F3, S2, F4
Rebuild prior to S1 not permitted since F1 and F2 are beyond the data availability cutoff point.
Rebuild from F3:
S2 included in this list since it will need to be updated during the rebuilding process
C = F1, F2, S1, F3a, S2a, F4
Removal of form data (archiving)
received_on
dateCase rebuild check rules
A case can be rebuilt if:
Storage
A case snapshot should be stored as a single form with all the necessary case blocks required to restore the case into the exact state it was when the snapshot was created. This may require some new case block primitives in order to allow setting certain metadata fields on the case.
The form will have the following XMLNS:
http://commcarehq.org/case/snapshot
As with normal forms each snapshot will be recorded in the OLTP form table as well as in the OLTP case transaction table. The case transaction will be of type form with an additional type bit set to indicate that it is a snapshot and allow easy filtering.
transaction.type = FORM | SNAPSHOT
Impact on users
Impact on hosting
This will increase the storage requirements for hosting CommCare by a small margin.
Backwards compatibility
NA
Release Timeline
NA
Open questions and issues
Should we try to avoid the situation where a case that only gets sporadic updates ends up with as many snapshots as forms. This could happen if a case only get updated once per month and the “max_snapshot_age = 1 month”
C = F1 S1 F2 S2 F3 S3 …
The text was updated successfully, but these errors were encountered: