Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency risk in CSI Snapshots #789

Open
jerome-jutteau opened this issue Oct 4, 2023 · 2 comments
Open

Inconsistency risk in CSI Snapshots #789

jerome-jutteau opened this issue Oct 4, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@jerome-jutteau
Copy link
Contributor

jerome-jutteau commented Oct 4, 2023

/kind bug

What happened?

During CreateSnapshot, CSI will return OK after calling CreateSnapshot (IaaS).
Once CreateSnapshot (CSI) returned OK, the CO now consider that the Snapshot is now "cut" in CSI specification (meaning the Snapshot's content cannot be altered by future writes).
Once the "cut" done, CO may "thaw" application which may continue writing on Volume.

However unlike EC2 behavior where "the point-in-time snapshot is created immediately", Outscale's Snapshot will be cut once the state "completed" is reached on IaaS:

The data contained in a snapshot is considered cut when the snapshot is in the completed state.

This behavior could lead CO to prematurely resume writes on Volume and alter Snapshot content.

What you expected to happen?

As described in CSI spec:

CreateSnapshot is a synchronous call and it MUST block until the snapshot is cut

In the current Outscale API version, CreateSnapshot (CSI) should block until Snapshot (IaaS) state reached "completed".

How to reproduce it (as minimally and precisely as possible)?

  1. Create a loop that appends current date to date.txt in a volume every seconds
  2. Trigger CreateSnapshot (CSI)
  3. Read creation_time of the Snapshot
  4. Restore Snapshot to a new Volume and read date file
  5. Compair dates between 3. and 4 => Dates written to restored Volume should be after creation_time

Anything else we need to know?:

Note that ready_to_use still switch to true once a Snapshot (IaaS) move to "complete" state as Outscale have no post-processing effort (unlike EC2).

🔥IMPLEMENTATION RISK🔥

Waiting for state to reach "complete" could easily timeout CSI calls which is ok as CO will call CreateSnapshot again and again.
If each pending call is not stopped once timeout is reached, each call may continue performing ReadSnapshots (IaaS) in an infinite loop and cause those issues:

  • Runners are occupied to run the same ReadSnapshots (IaaS) over and over, leading to useless API usage.
  • All runners may be saturated by the same task and controller cannot respond anymore leading to denial of service

Fix implementation should consider exit with an error instead of ReadSnapshot (IaaS) forever (could be a fixed allocated time, could be after first read, ...)

Environment

  • Driver version: <= 1.2.4
@jerome-jutteau jerome-jutteau added the bug Something isn't working label Oct 4, 2023
@jerome-jutteau jerome-jutteau assigned ghost Oct 4, 2023
@albundy83
Copy link
Contributor

Hello,

I'm looking to implement the new feature used here https://cloudnative-pg.io/documentation/current/backup_volumesnapshot/
Do you think we could be impacted by the bug you described ?

@jerome-jutteau
Copy link
Contributor Author

Hi @albundy83,

Yes, this theoretical bug could affect your implementation if it is based on Outscale's CSI snapshots.
We need to investigate more around this issue. cc @outscale-hmi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

2 participants