Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC connectivity stops for good in high traffic #9594

Open
moneromooo-monero opened this issue Nov 25, 2024 · 10 comments
Open

RPC connectivity stops for good in high traffic #9594

moneromooo-monero opened this issue Nov 25, 2024 · 10 comments

Comments

@moneromooo-monero
Copy link
Collaborator

moneromooo-monero commented Nov 25, 2024

I've been debugging it on and off in Townforge for quite a long time, as I thought it was specific to my changes, but I can actually get it to happen in Monero reliably. Townforge has quite heavy TF specific functional tests, which trigger is reliably, and I got Monero to trigger it reliably by simply calling a RPC over and over, with this patch:

diff --git a/tests/functional_tests/daemon_info.py b/tests/functional_tests/daemon_info.py
index 9d645330d..94ef57c5f 100755
--- a/tests/functional_tests/daemon_info.py
+++ b/tests/functional_tests/daemon_info.py
@@ -50,7 +50,8 @@ class DaemonGetInfoTest():
         print('Test hard_fork_info')
 
         daemon = Daemon()
-        res = daemon.hard_fork_info()
+        while True:
+            res = daemon.hard_fork_info()
 
         # hard_fork version should be set at height 1
         assert 'earliest_height' in res.keys()
diff --git a/tests/functional_tests/functional_tests_rpc.py b/tests/functional_tests/functional_tests_rpc.py
index e483352a4..449512339 100755
--- a/tests/functional_tests/functional_tests_rpc.py
+++ b/tests/functional_tests/functional_tests_rpc.py
@@ -52,7 +52,7 @@ WALLET_DIRECTORY = builddir + "/functional-tests-directory"
 FUNCTIONAL_TESTS_DIRECTORY = builddir + "/tests/functional_tests"
 DIFFICULTY = 10
 
-monerod_base = [builddir + "/bin/monerod", "--regtest", "--fixed-difficulty", str(DIFFICULTY), "--no-igd", "--p2p-bind-port", "monerod_p2p_port", "--rpc-bind-port", "monerod_rpc_port", "--zmq-rpc-bind-port", "monerod_zmq_port", "--zmq-pub", "monerod_zmq_pub", "--non-interactive", "--disable-dns-checkpoints", "--check-updates", "disabled", "--rpc-ssl", "disabled", "--data-dir", "monerod_data_dir", "--log-level", "1"]
+monerod_base = [builddir + "/bin/monerod", "--regtest", "--fixed-difficulty", str(DIFFICULTY), "--no-igd", "--p2p-bind-port", "monerod_p2p_port", "--rpc-bind-port", "monerod_rpc_port", "--zmq-rpc-bind-port", "monerod_zmq_port", "--zmq-pub", "monerod_zmq_pub", "--non-interactive", "--disable-dns-checkpoints", "--check-updates", "disabled", "--rpc-ssl", "disabled", "--data-dir", "monerod_data_dir", "--log-level", "3"]
 monerod_extra = [
   ["--offline"],
   ["--rpc-payment-address", "44SKxxLQw929wRF6BA9paQ1EWFshNnKhXM3qz6Mo3JGDE2YG3xyzVutMStEicxbQGRfrYvAAYxH6Fe8rnD56EaNwUiqhcwR", "--rpc-payment-difficulty", str(DIFFICULTY), "--rpc-payment-credits", "5000", "--offline"],


Note that setting log level to 3 is needed here. Running with log level 1 will not trigger it. In Townforge, log level 1 is fine. Log level 2 will trigger fairly quickly. Monero with log level 3 will trigger is pretty much at once.

Once triggered, it never recovers. I tried adding recovery code in Townforge, to no avail (that may be because the underlying issue is not what I vaguely expect it to be).

The symptoms are en exception in handle_accept, where a syscall returns EBADF. The socket is valid at the start of the function, and becomes invalid somewhere along the execution of handle_accept. AFAICT this is not a case of the connection being destroyed by another thread, but I'd be happy to be shown to be wrong there since it's the obvious inference.

I've spent days on this over the months, I hope someone with more networking chops can have a try at it.

Note that there's been reports of RPC connectivity going down over the years, that's probably the same thing.

@0xFFFC0000
Copy link
Collaborator

I can confirm this happening. and after (briefly) testing it, this is the call stack which consumes most of the computation time:

image

P.S. Take this information with grain of salt. I will profile / debug this tomorrow.

@moneromooo-monero
Copy link
Collaborator Author

To be clear, the issue isn't performance degradation due to heavy logging, it is the server stopping accepting connections after this:

ERROR net contrib/epee/include/net/abstract_tcp_server2.inl:1528 Exception in boosted_tcp_server<t_protocol_handler>::handle_accept: set_option: Bad file descriptor

Note that if you trace around, the EBADF might come from another function, set_option is just the most likely to get whacked.

@0xFFFC0000
Copy link
Collaborator

In that case, I left it running for about 10 minutes. But I don't have any

Exception in boosted_tcp_server<t_protocol_handler>::handle_accept: set_option: Bad file descriptor

in my logs.

Usually how long it takes for the exception to show up?

@moneromooo-monero
Copy link
Collaborator Author

In three attempts, about... 5 seconds, 5 seconds, 20 seconds maybe. After waiting for servers to be running. This is on master from a1dc85c.

@moneromooo-monero
Copy link
Collaborator Author

Running:

./tests/functional_tests/functional_tests_rpc.py /usr/bin/python tests/functional_tests/ build/Linux/master/release/ daemon_info

@0xFFFC0000
Copy link
Collaborator

I am hitting the infinite while loop correctly. But haven't been able to reproduce the Bad file descriptor. After 15 minutes of running.

I will update you if anything comes up, and will do it on bare metal machine too. I am doing it on a VM right now.

@moneromooo-monero
Copy link
Collaborator Author

I'm running on an old Fedora VM. I'll try setting up a more recent one later, it might be a dep issue if you can't get it to happen.

@0xFFFC0000
Copy link
Collaborator

I tried on a vm:

 $ >> cat /etc/os-release 
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"

@moneromooo-monero
Copy link
Collaborator Author

Also happens pretty much instantly on Fedora 41, GCC 14.2.1.

@moneromooo-monero
Copy link
Collaborator Author

Also Debian 12, GCC 12.2.0.

All of this running in Qubes OS, so there might be something weird to do with xen I guess, though it does seem a bit unlikely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants