Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: remember the name of the lock chain (nftables) #2550

Open
wants to merge 1 commit into
base: criu-dev
Choose a base branch
from

Conversation

adrianreber
Copy link
Member

Using libnftables the chain to lock the network is composed of ("CRIU-%d", real_pid). This leads to around 40 zdtm tests failing with errors like this:

Error: No such file or directory; did you mean table 'CRIU-62' in family inet?
delete table inet CRIU-86

The reason is that as soon as a process is running in a namespace the real PID can be anything and only the PID in the namespace is restored correctly. Relying on the real PID does not work for the chain name.

Using the PID of the innermost namespace would lead to the chain be called 'CRIU-1' most of the time which is also not really unique.

The uniqueness of the name was always problematic. With this change all tests are working again which rely on network locking if the nftables backend is used for network locking.

@adrianreber adrianreber force-pushed the 2024-12-17-nftables-lock-name branch from c483710 to 0305093 Compare December 17, 2024 13:46
@avagin avagin requested a review from mihalicyn December 17, 2024 16:25
@mihalicyn
Copy link
Member

Hi Adrian!

Please, can you tell how and in which circumstances you've caught this issue?
It is not on our GitHub CI tests, right?

As far as I understand the idea of your fix is to ensure that we keep nftables table name in the inventory image file instead of dynamically recalculate it on restore (using root_item->pid->real). Am I right?

A first question I have after going through this is "How this has worked before?".

P.S. I'll take a closer look into this. I've spent not enough time yet to fully understand what's going on there.

@adrianreber
Copy link
Member Author

A first question I have after going through this is "How this has worked before?".

It probably never did. We are not running all the tests on a system without iptables with nftables locking backend. Only two or four tests are running with the nftables backend.

@adrianreber
Copy link
Member Author

Please, can you tell how and in which circumstances you've caught this issue?

I am trying to switch the default locking backend in Fedora and CentOS >= 10 to nftables from iptables because iptables is no longer installed by default.

As far as I understand the idea of your fix is to ensure that we keep nftables table name in the inventory image file instead of dynamically recalculate it on restore (using root_item->pid->real). Am I right?

Yes. The table name makes sense if the locking and unlocking happens in the same CRIU run, but between CRIU runs it does not work with the existing approach.

@mihalicyn
Copy link
Member

Ah, thanks for clarifications!

I wonder if we can do something like this:

$ git diff
diff --git a/criu/netfilter.c b/criu/netfilter.c
index 9e78dc4b0..c558f9bf1 100644
--- a/criu/netfilter.c
+++ b/criu/netfilter.c
@@ -299,7 +299,7 @@ int nftables_lock_connection(struct inet_sk_desc *sk)
 
 int nftables_get_table(char *table, int n)
 {
-       if (snprintf(table, n, "inet CRIU-%d", root_item->pid->real) < 0) {
+       if (snprintf(table, n, "inet CRIU-%d", root_item->ids->pid_ns_id) < 0) {
                pr_err("Cannot generate CRIU's nftables table name\n");
                return -1;
        }

Yes, it's not a forward-compatible change and will break restore of images which were dumped with an older CRIU. In this form only works for experimental purposes (and have to check for root_item->ids->has_pid_ns_id too). But I'm curious if it helps.

@mihalicyn
Copy link
Member

mihalicyn commented Dec 18, 2024

My idea is that instead of introducing a new field nft_lock_table in the inventory_entry just for a single purpose, we can use inventory_entry->root_ids->pid_ns_id or root_item->ids->pid_ns_id as a source of unique criu run id. And we already do something like that when generate criu_run_id value:

void util_init(void)
{
...
	criu_run_id = getpid();
	if (!stat("/proc/self/ns/pid", &statbuf))
		criu_run_id |= (uint64_t)statbuf.st_ino << 32;

@adrianreber
Copy link
Member Author

@mihalicyn I am happy to use whatever makes most sense.

What is pid_ns_id? Is that basically the inode of the PID NS? Or more? It still sounds like something we need to save somewhere in the image, right?

Yes, it's not a forward-compatible change and will break restore of images which were dumped with an older CRIU.

I don't think we have to worry about this. Currently it doesn't work at all.

Let me know which ID makes most sense and I can rework this PR. I think the important part is that is has to come from some value of the checkpoint image and not be generated during restore.

@adrianreber
Copy link
Member Author

@mihalicyn I think I understood your proposal now. The PR could be really simple as pid_ns_id is already in the image. Let me try it out.

@adrianreber
Copy link
Member Author

With this line it also passes all the zdtm test cases (besides a couple of tests which call iptables (which I did not install)) if I switch to the nftables locking backend:

if (snprintf(table, n, "inet CRIU-%d", root_item->ids->pid_ns_id) < 0) {

That brings it down to a one line change. Very good idea @mihalicyn. Thanks.

How long can the pid_ns_id be? Currently the variable table is set to 32 characters.

@adrianreber
Copy link
Member Author

@mihalicyn Tests are happy, but root_item->ids->pid_ns_id always returns 1 when running in the host PID namespace.

So that is not really a good idea I think as it not really unique.

@mihalicyn
Copy link
Member

mihalicyn commented Dec 18, 2024

Hey Adrian,

Is that basically the inode of the PID NS?

Yes, precisely.

It still sounds like something we need to save somewhere in the image, right?

We don't as we already have it in the image anyways.

I don't think we have to worry about this. Currently it doesn't work at all.

Are we 100% percent sure that it doesn't work and never worked in any circumstances?

How long can the pid_ns_id be? Currently the variable table is set to 32 characters.

Hmm, it's uint32 so in a string representation I guess it's like 10 chars.

Tests are happy, but root_item->ids->pid_ns_id always returns 1 when running in the host PID namespace.

That's my bad, actually, to get pid namespace inode number you need something like:

		ns = lookup_ns_by_id(root_item->ids->pid_ns_id, &pid_ns_desc);
		if (ns) {
			ns->kid // << this thing is inode number of pid namespace

But yes, I don't think that even with this change having pid_ns_id would be enough, I think we still need to add a new field to inventory_entry but my point is to make it universal for different cases like this one and name it criu_run_id or something like that. Also, to clearly document when it's unique and for what purposes must be used.

@mihalicyn
Copy link
Member

Also, we have inventory_entry.dump_uptime field we can consume to get a certain degree of uniqueness.

@adrianreber
Copy link
Member Author

Ah, okay. So let's use the criu_run_id and store it in the inventory.

I don't think we have to worry about this. Currently it doesn't work at all.

Are we 100% percent sure that it doesn't work and never worked in any circumstances?

I don't know. All tests with open TCP connections are just hanging during restore because the network locking cannot be disabled. According to zdtm it is so broken that it doesn't work currently.

Also, we have inventory_entry.dump_uptime field we can consume to get a certain degree of uniqueness.

As an additional field in the nft table name? "CRIU-%d-%" PRIx64 " ", criu_run_id, inventory_entry.dump_uptime)

Or instead of criu_run_id? Currently we collect the uptime rather late in the process of checkpointing. Definitely after the network locking. It seems to be only used during detect_pid_reuse() and that is never directly use, but it seems is only relevant during pre-dump and looking at the parent process. So it seems like we could move the uptime detection to an earlier point and then also use it in the network locking chain. During restore we would need to look at the criu_run_id from the checkpointing run and at the uptime of the checkpointing run. That could work.

@rst0git
Copy link
Member

rst0git commented Dec 18, 2024

A first question I have after going through this is "How this has worked before?".

It probably never did. We are not running all the tests on a system without iptables with nftables locking backend. Only two or four tests are running with the nftables backend.

Would it be possible to add a CI workflow or modify an existing one to run all tests with the nftables backend?

Using libnftables the chain to lock the network is composed of
("CRIU-%d", real_pid). This leads to around 40 zdtm tests failing
with errors like this:

Error: No such file or directory; did you mean table 'CRIU-62' in family inet?
delete table inet CRIU-86

The reason is that as soon as a process is running in a namespace the
real PID can be anything and only the PID in the namespace is restored
correctly. Relying on the real PID does not work for the chain name.

Using the PID of the innermost namespace would lead to the chain be
called 'CRIU-1' most of the time which is also not really unique.

With this commit the change is now named using the already existing CRIU
run ID. To be able to correctly restore the process and delete the
locking table, the CRIU run id during checkpointing is now stored in the
inventory as dump_criu_run_id.

Signed-off-by: Adrian Reber <[email protected]>
@adrianreber adrianreber force-pushed the 2024-12-17-nftables-lock-name branch from 0305093 to 30e76fd Compare December 20, 2024 10:12
@adrianreber
Copy link
Member Author

@mihalicyn What do you think about the latest version. This works in my tests just as good as the previous version. Now using criu_run_id as suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants