Inconsistency risk in CSI Snapshots #789

jerome-jutteau · 2023-10-04T12:36:17Z

/kind bug

What happened?

During CreateSnapshot, CSI will return OK after calling CreateSnapshot (IaaS).
Once CreateSnapshot (CSI) returned OK, the CO now consider that the Snapshot is now "cut" in CSI specification (meaning the Snapshot's content cannot be altered by future writes).
Once the "cut" done, CO may "thaw" application which may continue writing on Volume.

However unlike EC2 behavior where "the point-in-time snapshot is created immediately", Outscale's Snapshot will be cut once the state "completed" is reached on IaaS:

The data contained in a snapshot is considered cut when the snapshot is in the completed state.

This behavior could lead CO to prematurely resume writes on Volume and alter Snapshot content.

What you expected to happen?

As described in CSI spec:

CreateSnapshot is a synchronous call and it MUST block until the snapshot is cut

In the current Outscale API version, CreateSnapshot (CSI) should block until Snapshot (IaaS) state reached "completed".

How to reproduce it (as minimally and precisely as possible)?

Create a loop that appends current date to date.txt in a volume every seconds
Trigger CreateSnapshot (CSI)
Read creation_time of the Snapshot
Restore Snapshot to a new Volume and read date file
Compair dates between 3. and 4 => Dates written to restored Volume should be after creation_time

Anything else we need to know?:

Note that ready_to_use still switch to true once a Snapshot (IaaS) move to "complete" state as Outscale have no post-processing effort (unlike EC2).

🔥IMPLEMENTATION RISK🔥

Waiting for state to reach "complete" could easily timeout CSI calls which is ok as CO will call CreateSnapshot again and again.
If each pending call is not stopped once timeout is reached, each call may continue performing ReadSnapshots (IaaS) in an infinite loop and cause those issues:

Runners are occupied to run the same ReadSnapshots (IaaS) over and over, leading to useless API usage.
All runners may be saturated by the same task and controller cannot respond anymore leading to denial of service

Fix implementation should consider exit with an error instead of ReadSnapshot (IaaS) forever (could be a fixed allocated time, could be after first read, ...)

Environment

Driver version: <= 1.2.4

The text was updated successfully, but these errors were encountered:

albundy83 · 2024-03-25T07:30:20Z

Hello,

I'm looking to implement the new feature used here https://cloudnative-pg.io/documentation/current/backup_volumesnapshot/
Do you think we could be impacted by the bug you described ?

jerome-jutteau · 2024-04-15T09:08:15Z

Hi @albundy83,

Yes, this theoretical bug could affect your implementation if it is based on Outscale's CSI snapshots.
We need to investigate more around this issue. cc @outscale-hmi

jerome-jutteau added the bug Something isn't working label Oct 4, 2023

jerome-jutteau assigned ghost Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency risk in CSI Snapshots #789

Inconsistency risk in CSI Snapshots #789

jerome-jutteau commented Oct 4, 2023 •

edited

Loading

albundy83 commented Mar 25, 2024

jerome-jutteau commented Apr 15, 2024

Inconsistency risk in CSI Snapshots #789

Inconsistency risk in CSI Snapshots #789

Comments

jerome-jutteau commented Oct 4, 2023 • edited Loading

albundy83 commented Mar 25, 2024

jerome-jutteau commented Apr 15, 2024

jerome-jutteau commented Oct 4, 2023 •

edited

Loading