net: remember the name of the lock chain (nftables) #2550

adrianreber · 2024-12-17T13:44:30Z

Using libnftables the chain to lock the network is composed of ("CRIU-%d", real_pid). This leads to around 40 zdtm tests failing with errors like this:

Error: No such file or directory; did you mean table 'CRIU-62' in family inet?
delete table inet CRIU-86

The reason is that as soon as a process is running in a namespace the real PID can be anything and only the PID in the namespace is restored correctly. Relying on the real PID does not work for the chain name.

Using the PID of the innermost namespace would lead to the chain be called 'CRIU-1' most of the time which is also not really unique.

The uniqueness of the name was always problematic. With this change all tests are working again which rely on network locking if the nftables backend is used for network locking.

mihalicyn · 2024-12-18T13:53:30Z

Hi Adrian!

Please, can you tell how and in which circumstances you've caught this issue?
It is not on our GitHub CI tests, right?

As far as I understand the idea of your fix is to ensure that we keep nftables table name in the inventory image file instead of dynamically recalculate it on restore (using root_item->pid->real). Am I right?

A first question I have after going through this is "How this has worked before?".

P.S. I'll take a closer look into this. I've spent not enough time yet to fully understand what's going on there.

adrianreber · 2024-12-18T14:36:02Z

A first question I have after going through this is "How this has worked before?".

It probably never did. We are not running all the tests on a system without iptables with nftables locking backend. Only two or four tests are running with the nftables backend.

adrianreber · 2024-12-18T15:03:31Z

Please, can you tell how and in which circumstances you've caught this issue?

I am trying to switch the default locking backend in Fedora and CentOS >= 10 to nftables from iptables because iptables is no longer installed by default.

As far as I understand the idea of your fix is to ensure that we keep nftables table name in the inventory image file instead of dynamically recalculate it on restore (using root_item->pid->real). Am I right?

Yes. The table name makes sense if the locking and unlocking happens in the same CRIU run, but between CRIU runs it does not work with the existing approach.

mihalicyn · 2024-12-18T15:22:30Z

Ah, thanks for clarifications!

I wonder if we can do something like this:

$ git diff
diff --git a/criu/netfilter.c b/criu/netfilter.c
index 9e78dc4b0..c558f9bf1 100644
--- a/criu/netfilter.c
+++ b/criu/netfilter.c
@@ -299,7 +299,7 @@ int nftables_lock_connection(struct inet_sk_desc *sk)
 
 int nftables_get_table(char *table, int n)
 {
-       if (snprintf(table, n, "inet CRIU-%d", root_item->pid->real) < 0) {
+       if (snprintf(table, n, "inet CRIU-%d", root_item->ids->pid_ns_id) < 0) {
                pr_err("Cannot generate CRIU's nftables table name\n");
                return -1;
        }

Yes, it's not a forward-compatible change and will break restore of images which were dumped with an older CRIU. In this form only works for experimental purposes (and have to check for root_item->ids->has_pid_ns_id too). But I'm curious if it helps.

mihalicyn · 2024-12-18T15:27:38Z

My idea is that instead of introducing a new field nft_lock_table in the inventory_entry just for a single purpose, we can use inventory_entry->root_ids->pid_ns_id or root_item->ids->pid_ns_id as a source of unique criu run id. And we already do something like that when generate criu_run_id value:

void util_init(void)
{
...
	criu_run_id = getpid();
	if (!stat("/proc/self/ns/pid", &statbuf))
		criu_run_id |= (uint64_t)statbuf.st_ino << 32;

adrianreber · 2024-12-18T15:59:35Z

@mihalicyn I am happy to use whatever makes most sense.

What is pid_ns_id? Is that basically the inode of the PID NS? Or more? It still sounds like something we need to save somewhere in the image, right?

Yes, it's not a forward-compatible change and will break restore of images which were dumped with an older CRIU.

I don't think we have to worry about this. Currently it doesn't work at all.

Let me know which ID makes most sense and I can rework this PR. I think the important part is that is has to come from some value of the checkpoint image and not be generated during restore.

adrianreber · 2024-12-18T16:12:21Z

@mihalicyn I think I understood your proposal now. The PR could be really simple as pid_ns_id is already in the image. Let me try it out.

adrianreber · 2024-12-18T16:45:44Z

With this line it also passes all the zdtm test cases (besides a couple of tests which call iptables (which I did not install)) if I switch to the nftables locking backend:

if (snprintf(table, n, "inet CRIU-%d", root_item->ids->pid_ns_id) < 0) {

That brings it down to a one line change. Very good idea @mihalicyn. Thanks.

How long can the pid_ns_id be? Currently the variable table is set to 32 characters.

adrianreber · 2024-12-18T17:21:43Z

@mihalicyn Tests are happy, but root_item->ids->pid_ns_id always returns 1 when running in the host PID namespace.

So that is not really a good idea I think as it not really unique.

mihalicyn · 2024-12-18T17:41:38Z

Hey Adrian,

Is that basically the inode of the PID NS?

Yes, precisely.

It still sounds like something we need to save somewhere in the image, right?

We don't as we already have it in the image anyways.

I don't think we have to worry about this. Currently it doesn't work at all.

Are we 100% percent sure that it doesn't work and never worked in any circumstances?

How long can the pid_ns_id be? Currently the variable table is set to 32 characters.

Hmm, it's uint32 so in a string representation I guess it's like 10 chars.

Tests are happy, but root_item->ids->pid_ns_id always returns 1 when running in the host PID namespace.

That's my bad, actually, to get pid namespace inode number you need something like:

		ns = lookup_ns_by_id(root_item->ids->pid_ns_id, &pid_ns_desc);
		if (ns) {
			ns->kid // << this thing is inode number of pid namespace

But yes, I don't think that even with this change having pid_ns_id would be enough, I think we still need to add a new field to inventory_entry but my point is to make it universal for different cases like this one and name it criu_run_id or something like that. Also, to clearly document when it's unique and for what purposes must be used.

mihalicyn · 2024-12-18T17:44:11Z

Also, we have inventory_entry.dump_uptime field we can consume to get a certain degree of uniqueness.

adrianreber · 2024-12-18T17:59:44Z

Ah, okay. So let's use the criu_run_id and store it in the inventory.

I don't think we have to worry about this. Currently it doesn't work at all.

Are we 100% percent sure that it doesn't work and never worked in any circumstances?

I don't know. All tests with open TCP connections are just hanging during restore because the network locking cannot be disabled. According to zdtm it is so broken that it doesn't work currently.

Also, we have inventory_entry.dump_uptime field we can consume to get a certain degree of uniqueness.

As an additional field in the nft table name? "CRIU-%d-%" PRIx64 " ", criu_run_id, inventory_entry.dump_uptime)

Or instead of criu_run_id? Currently we collect the uptime rather late in the process of checkpointing. Definitely after the network locking. It seems to be only used during detect_pid_reuse() and that is never directly use, but it seems is only relevant during pre-dump and looking at the parent process. So it seems like we could move the uptime detection to an earlier point and then also use it in the network locking chain. During restore we would need to look at the criu_run_id from the checkpointing run and at the uptime of the checkpointing run. That could work.

rst0git · 2024-12-18T20:12:58Z

A first question I have after going through this is "How this has worked before?".

It probably never did. We are not running all the tests on a system without iptables with nftables locking backend. Only two or four tests are running with the nftables backend.

Would it be possible to add a CI workflow or modify an existing one to run all tests with the nftables backend?

Using libnftables the chain to lock the network is composed of ("CRIU-%d", real_pid). This leads to around 40 zdtm tests failing with errors like this: Error: No such file or directory; did you mean table 'CRIU-62' in family inet? delete table inet CRIU-86 The reason is that as soon as a process is running in a namespace the real PID can be anything and only the PID in the namespace is restored correctly. Relying on the real PID does not work for the chain name. Using the PID of the innermost namespace would lead to the chain be called 'CRIU-1' most of the time which is also not really unique. With this commit the change is now named using the already existing CRIU run ID. To be able to correctly restore the process and delete the locking table, the CRIU run id during checkpointing is now stored in the inventory as dump_criu_run_id. Signed-off-by: Adrian Reber <[email protected]>

adrianreber · 2024-12-20T10:13:22Z

@mihalicyn What do you think about the latest version. This works in my tests just as good as the previous version. Now using criu_run_id as suggested.

adrianreber force-pushed the 2024-12-17-nftables-lock-name branch from c483710 to 0305093 Compare December 17, 2024 13:46

avagin requested a review from mihalicyn December 17, 2024 16:25

rst0git mentioned this pull request Dec 18, 2024

network lock method for checkpoint/restore defaults to iptables containers/crun#1627

Open

adrianreber force-pushed the 2024-12-17-nftables-lock-name branch from 0305093 to 30e76fd Compare December 20, 2024 10:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net: remember the name of the lock chain (nftables) #2550

net: remember the name of the lock chain (nftables) #2550

adrianreber commented Dec 17, 2024

mihalicyn commented Dec 18, 2024

adrianreber commented Dec 18, 2024

adrianreber commented Dec 18, 2024

mihalicyn commented Dec 18, 2024

mihalicyn commented Dec 18, 2024 •

edited

Loading

adrianreber commented Dec 18, 2024

adrianreber commented Dec 18, 2024

adrianreber commented Dec 18, 2024

adrianreber commented Dec 18, 2024

mihalicyn commented Dec 18, 2024 •

edited

Loading

mihalicyn commented Dec 18, 2024

adrianreber commented Dec 18, 2024

rst0git commented Dec 18, 2024 •

edited

Loading

adrianreber commented Dec 20, 2024

net: remember the name of the lock chain (nftables) #2550

Are you sure you want to change the base?

net: remember the name of the lock chain (nftables) #2550

Conversation

adrianreber commented Dec 17, 2024

mihalicyn commented Dec 18, 2024

adrianreber commented Dec 18, 2024

adrianreber commented Dec 18, 2024

mihalicyn commented Dec 18, 2024

mihalicyn commented Dec 18, 2024 • edited Loading

adrianreber commented Dec 18, 2024

adrianreber commented Dec 18, 2024

adrianreber commented Dec 18, 2024

adrianreber commented Dec 18, 2024

mihalicyn commented Dec 18, 2024 • edited Loading

mihalicyn commented Dec 18, 2024

adrianreber commented Dec 18, 2024

rst0git commented Dec 18, 2024 • edited Loading

adrianreber commented Dec 20, 2024

mihalicyn commented Dec 18, 2024 •

edited

Loading

mihalicyn commented Dec 18, 2024 •

edited

Loading

rst0git commented Dec 18, 2024 •

edited

Loading