RPC connectivity stops for good in high traffic #9594

moneromooo-monero · 2024-11-25T14:24:51Z

I've been debugging it on and off in Townforge for quite a long time, as I thought it was specific to my changes, but I can actually get it to happen in Monero reliably. Townforge has quite heavy TF specific functional tests, which trigger is reliably, and I got Monero to trigger it reliably by simply calling a RPC over and over, with this patch:

diff --git a/tests/functional_tests/daemon_info.py b/tests/functional_tests/daemon_info.py
index 9d645330d..94ef57c5f 100755
--- a/tests/functional_tests/daemon_info.py
+++ b/tests/functional_tests/daemon_info.py
@@ -50,7 +50,8 @@ class DaemonGetInfoTest():
         print('Test hard_fork_info')
 
         daemon = Daemon()
-        res = daemon.hard_fork_info()
+        while True:
+            res = daemon.hard_fork_info()
 
         # hard_fork version should be set at height 1
         assert 'earliest_height' in res.keys()
diff --git a/tests/functional_tests/functional_tests_rpc.py b/tests/functional_tests/functional_tests_rpc.py
index e483352a4..449512339 100755
--- a/tests/functional_tests/functional_tests_rpc.py
+++ b/tests/functional_tests/functional_tests_rpc.py
@@ -52,7 +52,7 @@ WALLET_DIRECTORY = builddir + "/functional-tests-directory"
 FUNCTIONAL_TESTS_DIRECTORY = builddir + "/tests/functional_tests"
 DIFFICULTY = 10
 
-monerod_base = [builddir + "/bin/monerod", "--regtest", "--fixed-difficulty", str(DIFFICULTY), "--no-igd", "--p2p-bind-port", "monerod_p2p_port", "--rpc-bind-port", "monerod_rpc_port", "--zmq-rpc-bind-port", "monerod_zmq_port", "--zmq-pub", "monerod_zmq_pub", "--non-interactive", "--disable-dns-checkpoints", "--check-updates", "disabled", "--rpc-ssl", "disabled", "--data-dir", "monerod_data_dir", "--log-level", "1"]
+monerod_base = [builddir + "/bin/monerod", "--regtest", "--fixed-difficulty", str(DIFFICULTY), "--no-igd", "--p2p-bind-port", "monerod_p2p_port", "--rpc-bind-port", "monerod_rpc_port", "--zmq-rpc-bind-port", "monerod_zmq_port", "--zmq-pub", "monerod_zmq_pub", "--non-interactive", "--disable-dns-checkpoints", "--check-updates", "disabled", "--rpc-ssl", "disabled", "--data-dir", "monerod_data_dir", "--log-level", "3"]
 monerod_extra = [
   ["--offline"],
   ["--rpc-payment-address", "44SKxxLQw929wRF6BA9paQ1EWFshNnKhXM3qz6Mo3JGDE2YG3xyzVutMStEicxbQGRfrYvAAYxH6Fe8rnD56EaNwUiqhcwR", "--rpc-payment-difficulty", str(DIFFICULTY), "--rpc-payment-credits", "5000", "--offline"],

Note that setting log level to 3 is needed here. Running with log level 1 will not trigger it. In Townforge, log level 1 is fine. Log level 2 will trigger fairly quickly. Monero with log level 3 will trigger is pretty much at once.

Once triggered, it never recovers. I tried adding recovery code in Townforge, to no avail (that may be because the underlying issue is not what I vaguely expect it to be).

The symptoms are en exception in handle_accept, where a syscall returns EBADF. The socket is valid at the start of the function, and becomes invalid somewhere along the execution of handle_accept. AFAICT this is not a case of the connection being destroyed by another thread, but I'd be happy to be shown to be wrong there since it's the obvious inference.

I've spent days on this over the months, I hope someone with more networking chops can have a try at it.

Note that there's been reports of RPC connectivity going down over the years, that's probably the same thing.

The text was updated successfully, but these errors were encountered:

0xFFFC0000 · 2024-11-25T15:07:34Z

I can confirm this happening. and after (briefly) testing it, this is the call stack which consumes most of the computation time:

P.S. Take this information with grain of salt. I will profile / debug this tomorrow.

moneromooo-monero · 2024-11-25T15:13:00Z

To be clear, the issue isn't performance degradation due to heavy logging, it is the server stopping accepting connections after this:

ERROR net contrib/epee/include/net/abstract_tcp_server2.inl:1528 Exception in boosted_tcp_server<t_protocol_handler>::handle_accept: set_option: Bad file descriptor

Note that if you trace around, the EBADF might come from another function, set_option is just the most likely to get whacked.

0xFFFC0000 · 2024-11-25T15:33:16Z

In that case, I left it running for about 10 minutes. But I don't have any

Exception in boosted_tcp_server<t_protocol_handler>::handle_accept: set_option: Bad file descriptor

in my logs.

Usually how long it takes for the exception to show up?

moneromooo-monero · 2024-11-25T15:44:08Z

In three attempts, about... 5 seconds, 5 seconds, 20 seconds maybe. After waiting for servers to be running. This is on master from a1dc85c.

moneromooo-monero · 2024-11-25T15:45:04Z

Running:

./tests/functional_tests/functional_tests_rpc.py /usr/bin/python tests/functional_tests/ build/Linux/master/release/ daemon_info

0xFFFC0000 · 2024-11-25T15:51:52Z

I am hitting the infinite while loop correctly. But haven't been able to reproduce the Bad file descriptor. After 15 minutes of running.

I will update you if anything comes up, and will do it on bare metal machine too. I am doing it on a VM right now.

moneromooo-monero · 2024-11-25T15:56:11Z

I'm running on an old Fedora VM. I'll try setting up a more recent one later, it might be a dep issue if you can't get it to happen.

0xFFFC0000 · 2024-11-25T15:58:29Z

I tried on a vm:

 $ >> cat /etc/os-release 
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"

moneromooo-monero · 2024-11-26T16:08:50Z

Also happens pretty much instantly on Fedora 41, GCC 14.2.1.

moneromooo-monero · 2024-11-27T11:43:50Z

Also Debian 12, GCC 12.2.0.

All of this running in Qubes OS, so there might be something weird to do with xen I guess, though it does seem a bit unlikely.

0xFFFC0000 added important discussion labels Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RPC connectivity stops for good in high traffic #9594

RPC connectivity stops for good in high traffic #9594

moneromooo-monero commented Nov 25, 2024 •

edited

Loading

0xFFFC0000 commented Nov 25, 2024

moneromooo-monero commented Nov 25, 2024

0xFFFC0000 commented Nov 25, 2024

moneromooo-monero commented Nov 25, 2024

moneromooo-monero commented Nov 25, 2024

0xFFFC0000 commented Nov 25, 2024

moneromooo-monero commented Nov 25, 2024

0xFFFC0000 commented Nov 25, 2024

moneromooo-monero commented Nov 26, 2024

moneromooo-monero commented Nov 27, 2024

RPC connectivity stops for good in high traffic #9594

RPC connectivity stops for good in high traffic #9594

Comments

moneromooo-monero commented Nov 25, 2024 • edited Loading

0xFFFC0000 commented Nov 25, 2024

moneromooo-monero commented Nov 25, 2024

0xFFFC0000 commented Nov 25, 2024

moneromooo-monero commented Nov 25, 2024

moneromooo-monero commented Nov 25, 2024

0xFFFC0000 commented Nov 25, 2024

moneromooo-monero commented Nov 25, 2024

0xFFFC0000 commented Nov 25, 2024

moneromooo-monero commented Nov 26, 2024

moneromooo-monero commented Nov 27, 2024

moneromooo-monero commented Nov 25, 2024 •

edited

Loading