Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist userdata sysctl changes to /etc/sysctl.d #4314

Open
aetimmes opened this issue Nov 26, 2024 · 8 comments
Open

Persist userdata sysctl changes to /etc/sysctl.d #4314

aetimmes opened this issue Nov 26, 2024 · 8 comments
Assignees
Labels
status/needs-triage Pending triage or re-evaluation type/enhancement New feature or request

Comments

@aetimmes
Copy link

aetimmes commented Nov 26, 2024

What I'd like:

Persist userdata-provided sysctls to /etc/sysctl.d in a format where they can survive through systemd-sysctl.service restarts

Background

Currently, passing in sysctls via userdata results in the apiclient running sysctl commands directly.

We pass in sysctls into our Bottlerocket AMI via userdata, but also run Cilium, which has a sysctl-fixup container that creates a config file (/etc/sysctl.d/99-zzz-override_cilium.conf by default) and restarts the host sysctld systemd unit file. When this happens, the userdata-provided sysctls are clobbered.

Bootstrap commands/containers explicitly don't have the capability to modify config files in /etc, so there's no good solution to this issue currently.

Any alternatives you've considered:

  • Admittedly none
@aetimmes aetimmes added status/needs-triage Pending triage or re-evaluation type/enhancement New feature or request labels Nov 26, 2024
@koooosh
Copy link
Contributor

koooosh commented Nov 26, 2024

Hey @aetimmes , thanks for cutting this issue.

Currently, passing in sysctls via userdata results in the apiclient running sysctl commands directly.

Can you please provide an example of how you are doing this currently?

Are you using the kernel-sysctl setting?

@aetimmes
Copy link
Author

Yes, we're setting this option in the CRD for our Karpenter NodeClass:

  userData: |
   ...
        [settings.kernel.sysctl]
        "user.max_user_namespaces" = "16384"

@ytsssun
Copy link
Contributor

ytsssun commented Nov 27, 2024

I was able to reproduce this.

Steps:

  1. Connect to a running Bottlerocket Node.
  2. Set api via:
apiclient set --json '{"kernel": {"sysctl": {"user.max_user_namespaces": "16384"}}}'
  1. Verify that the sysctl parameter gets set:
{
  "settings": {
    "kernel": {
      "sysctl": {
        "user.max_user_namespaces": "16384"
      }
    }
  }
}
  1. The value is applied
bash-5.1# cat /proc/sys/user/max_user_namespaces
16384
  1. Restart the systemd-sysctl.service.
bash-5.1# systemctl restart systemd-sysctl.service
bash-5.1# systemctl status systemd-sysctl.service
● systemd-sysctl.service - Apply Kernel Variables
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/systemd-sysctl.service; static)
    Drop-In: /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d
             └─00-aws-config.conf
     Active: active (exited) since Wed 2024-11-27 00:45:53 UTC; 10s ago
       Docs: man:systemd-sysctl.service(8)
             man:sysctl.d(5)
    Process: 66594 ExecStart=/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/systemd-sysctl (code=exited, status=0/SUCCESS)
   Main PID: 66594 (code=exited, status=0/SUCCESS)
        CPU: 5ms

Nov 27 00:45:53 ip-192-168-134-106.us-west-2.compute.internal systemd[1]: Starting Apply Kernel Variables...
Nov 27 00:45:53 ip-192-168-134-106.us-west-2.compute.internal systemd[1]: Finished Apply Kernel Variables.
  1. The user.max_user_namespaces kernel parameter gets reset to 0.
bash-5.1# cat /proc/sys/user/max_user_namespaces
0
  1. And yet the settings API value persists.
bash-5.1# apiclient get settings.kernel.sysctl
{
  "settings": {
    "kernel": {
      "sysctl": {
        "user.max_user_namespaces": "16384"
      }
    }
  }
}

Interesting thing is that, I found an exception where if I set the kernel parameters under vm path like vm.nr_overcommit_hugepages, the kernel parameter persists.

Note that the setting settings.kernel.sysctl.user.max_user_namespaces set via apiclient or userdata will be written to the rootfs at "/proc/sys/user/max_user_namespaces", so it does persist on the file system. The behavior that the "user.max_user_namespaces" gets reset to 0 and the "vm.nr_overcommit_hugepages" survives the restart makes me think it is some other special handling the sysctld service may have that treats those parameters differently. We would need to take a further look at this.

@aetimmes
Copy link
Author

aetimmes commented Nov 27, 2024

That's because /aarch64-bottlerocket-linux-gnu/sys-root/usr/lib/sysctl.d/80-release.conf contains

# Disable User Namespaces, as it opens up a large attack surface to unprivileged users.
user.max_user_namespaces = 0

so when sysctld restarts it happily clobbers the value for user.max_user_namespaces set post-boot by apiclient (along with all the other values set in config files in that directory, which can be found by running /aarch64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/systemd-sysctl --cat-config).

On the other hand, no file in that same directory modifies vm.nr_overcommit_hugepages, so when sysctld restarts, that value doesn't get clobbered.

so it does persist on the file system

My pedantic side requires me to say that procfs isn't disk and as such is erased when the machine loses power, so I'd quibble with your usage of "persist". 😃 But you're correct in that apiclient does successfully write the sysctl to procfs when it runs (If I recall correctly, I believe it yeets the value directly into memory by doing echo $VALUE > /proc/sys/$KEY).

@ytsssun
Copy link
Contributor

ytsssun commented Nov 27, 2024

That's because /aarch64-bottlerocket-linux-gnu/sys-root/usr/lib/sysctl.d/80-release.conf contains

# Disable User Namespaces, as it opens up a large attack surface to unprivileged users.
user.max_user_namespaces = 0

Thanks for the nice catch. It makes perfect sense to me. With that conf file, it indeed resets the user.max_user_namespaces value upon restart of the sysctld service.

And you are absolutely right about procfs. It is a virtual filesystem that provides an interface to kernel data structures, it's not actually stored on disk. This is indeed something we should fix so that the restart of the sysctld would not discard the kernel parameters we set via settings API.

In Bottlerocket, we use corndog sysctl to apply the kernel parameters. We have mechanism here to detect Settings change and re-trigger corndog, which is how the sysctl values can be dynamically modified via settings API. But that does not cover the case you encountered.

To solve this issue, we can try to add a drop-in for systemd-sysctl.service which would always trigger corndog sysctl after restart. I did below test to verify this.

  1. Add below drop-in to /etc/systemd/system/systemd-sysctl.service.d/10-corndog.conf
[Service]
# Run corndog after systemd-sysctl to re-apply API-configured sysctl settings
ExecStartPost=/usr/bin/corndog sysctl
  1. Reload systemd: systemctl daemon-reload.
  2. Restart sysctld service: systemctl restart systemd-sysctl.service.
  3. Now the user.max_user_namespaces is preserved by API client. Note that, by doing this, we would make the kernel parameter set via Settings API the highest precedence. I have started a thread with the team to finalize on the approach, but I do believe we could do something like this.

@aetimmes
Copy link
Author

Is corndog sysctl used in any context where sysctls are not managed by systemd-sysctl? If not, then it may be cleaner to have corndog manage a file in /etc/sysctl.d, so that when the API client is called:

  • corndog appends the new sysctl directive to a file (/etc/sysctl.d/99-bottlerocket-userdata.conf?)
    • prepending a - to the provided directive if it doesn't already exist should maintain the existing "don't fail" behavior described (here)[https://github.com/bottlerocket-os/bottlerocket-core-kit/blob/develop/sources/api/corndog/src/main.rs#L116]
  • after making the change, trigger systemctl restart systemd-sysctl to apply the changes

This way we're using a unified system-provided interface to manage these sysctls.

Looking at the existing config files from one of my bottlerocket instances, the highest-priority system-provided conf file I can find is 90 (from /aarch64-bottlerocket-linux-gnu/sys-root/usr/lib/sysctl.d/90-kubelet.conf and /etc/sysctl.d/90-primary_interface.conf). Any priority between 91-99 should work, but it might also be nice to be able to configure the priority of the corndog sysctl conf file to allow for other services to take precedence if needed.

@ytsssun
Copy link
Contributor

ytsssun commented Dec 2, 2024

Hi @aetimmes, yes that does sound like a cleaner alternative to achieve this. For corndog sysctl usage, we just need to be able to update the sysctl values whenever we set the settings at settings.kernel.sysctl (either via userdata or API Client). Our existing implementation is to write to /proc/sys directly. But we should be able to switch to the sysctl.d drop-in approach by writing the kernel.sysctl settings to the /etc/sysctl.d/99-corndog.conf and run /usr/lib/systemd/systemd-sysctl /etc/sysctl.d/99-corndog.conf to apply the conf file.

We need to basically rewrite set_sysctls function to reflect that. Would you be interested in contributing the fix yourself? It seems like a manageable change that would be great for a first contribution. I'd be happy to provide guidance for building and testing if needed.
If you don't have the time or prefer not to take this on, no worries at all - I'd be glad to work on it myself. Just let me know what you think!

@mikn
Copy link

mikn commented Dec 17, 2024

Read this just out of interest - we will most likely encounter this in the next few days as we also run Cilium. We will do the system unit drop-in to move past it for now.

@ytsssun ytsssun self-assigned this Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/needs-triage Pending triage or re-evaluation type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants