Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed RPC connection to the node 'ejabberd@xxx': timeout #4270

Open
zzzZLRzzz opened this issue Aug 20, 2024 · 10 comments
Open

Failed RPC connection to the node 'ejabberd@xxx': timeout #4270

zzzZLRzzz opened this issue Aug 20, 2024 · 10 comments

Comments

@zzzZLRzzz
Copy link

Environment

  • ejabberd version: 21.07
  • OS: Linux (Debian) alpine 3.14
  • Installed from: docker-ejabberd

Errors from error.log/crash.log

No errors

Bug description

deploy ejabberd on k8s, this time when a update the application and create a new node ejabberd@bbb.
using ejabberdctl join_cluster ejabberd@aaa
the log print: "Failed RPC connection to the node 'ejabberd@bbb': timeout"
i found that the application seems to restart after i use the join_cluster command.
but this time it become unhealthy and unrecoverable

@licaon-kter
Copy link
Contributor

21.07? why so old.. latest is 24.07 😄

@prefiks
Copy link
Member

prefiks commented Aug 20, 2024

Do 'aaa' and 'bbb' resolve on both nodes correctly? And is network setup in such way that they both can access themselves?

@zzzZLRzzz
Copy link
Author

Do 'aaa' and 'bbb' resolve on both nodes correctly? And is network setup in such way that they both can access themselves?

using ping to check, can resolved correctly and can access themselves
using ejabberdctl ping in 'aaa' to check connection to 'bbb', it never comes to 'pong'
i think the reason is that after join_cluster command, 'bbb' tried to restart itself and failed?

@prefiks
Copy link
Member

prefiks commented Aug 21, 2024

Yes, that probably why this failed. My guess would be ports used for communications aren't accessible between nodes. Or could be issue with cookie used for authentication (to check that you can use debug console and execute erlang:get_cookie(). on both nodes, and see if that returns same value).

@zzzZLRzzz
Copy link
Author

Yes, that probably why this failed. My guess would be ports used for communications aren't accessible between nodes. Or could be issue with cookie used for authentication (to check that you can use debug console and execute erlang:get_cookie(). on both nodes, and see if that returns same value).

the cookie were same, I manual added ip and hosts in aaa and bbb's /etc/hosts
tried ejabberdctl ping in aaa and bbb, the result was 'pong'
it seems that they can connect by this command
but then i tried to use ejabberdctl join_cluster, the result is still timeout

@prefiks
Copy link
Member

prefiks commented Aug 21, 2024

Can you see what mnesia:info(). returns?

@zzzZLRzzz
Copy link
Author

zzzZLRzzz commented Aug 21, 2024

Can you see what mnesia:info(). returns?

=============================================
([email protected])1> mnesia:info().
---> Processes holding locks <---
---> Processes waiting for locks <---
---> Participant transactions <---
---> Coordinator transactions <---
---> Uncertain transactions <---
---> Active tables <---
sip_session    : with 0        records occupying 305      words of mem
push_session   : with 0        records occupying 5464     bytes on disc
mqtt_session   : with 0        records occupying 305      words of mem
last_activity  : with 153016   records occupying 22370422 bytes on disc
pubsub_index   : with 0        records occupying 305      words of mem
vcard_search   : with 0        records occupying 305      words of mem
oauth_client   : with 0        records occupying 305      words of mem
archive_prefs  : with 0        records occupying 5432     bytes on disc
motd_users     : with 0        records occupying 5464     bytes on disc
mqtt_sub       : with 35       records occupying 2027     words of mem
mqtt_pub       : with 0        records occupying 5464     bytes on disc
sr_user        : with 0        records occupying 305      words of mem
schema         : with 42       records occupying 6687     words of mem
roster_version : with 0        records occupying 5464     bytes on disc
pubsub_orphan  : with 0        records occupying 305      words of mem
session        : with 5678     records occupying 437771   words of mem
pubsub_last_item: with 0        records occupying 305      words of mem
offline_msg    : with 0        records occupying 5432     bytes on disc
archive_msg    : with 0        records occupying 5432     bytes on disc
route          : with 10       records occupying 539      words of mem
private_storage: with 0        records occupying 5464     bytes on disc
motd           : with 0        records occupying 5464     bytes on disc
caps_features  : with 0        records occupying 5464     bytes on disc
sr_group       : with 0        records occupying 305      words of mem
oauth_token    : with 0        records occupying 5464     bytes on disc
bytestream     : with 0        records occupying 305      words of mem
pubsub_item    : with 0        records occupying 5464     bytes on disc
muc_room       : with 0        records occupying 305      words of mem
privacy        : with 0        records occupying 5432     bytes on disc
bosh           : with 0        records occupying 305      words of mem
pubsub_state   : with 0        records occupying 139      words of mem
temporarily_blocked: with 0        records occupying 305      words of mem
muc_registered : with 0        records occupying 305      words of mem
s2s            : with 0        records occupying 305      words of mem
ejabberd_commands: with 160      records occupying 57116    words of mem
session_counter: with 0        records occupying 305      words of mem
mod_register_ip: with 0        records occupying 305      words of mem
muc_online_room: with 0        records occupying 139      words of mem
pubsub_node    : with 0        records occupying 305      words of mem
vcard          : with 0        records occupying 5432     bytes on disc
route_multicast: with 0        records occupying 305      words of mem
roster         : with 0        records occupying 5464     bytes on disc
===> System info in version "4.20.4.1", debug level = none <===
opt_disc. Directory "/home/ejabberd/database/[email protected]" is used.
use fallback at restart = false
running db nodes   = ['[email protected]','[email protected]']
stopped db nodes   = ['[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]']
master node tables = []
remote             = []
ram_copies         = [bosh,bytestream,ejabberd_commands,mod_register_ip,
                      mqtt_session,mqtt_sub,muc_online_room,pubsub_last_item,
                      route,route_multicast,s2s,session,session_counter,
                      sip_session,temporarily_blocked]
disc_copies        = [muc_registered,muc_room,oauth_client,pubsub_index,
                      pubsub_node,pubsub_orphan,pubsub_state,schema,sr_group,
                      sr_user,vcard_search]
disc_only_copies   = [archive_msg,archive_prefs,caps_features,last_activity,
                      motd,motd_users,mqtt_pub,oauth_token,offline_msg,
                      privacy,private_storage,pubsub_item,push_session,roster,
                      roster_version,vcard]
[{'[email protected]',
     disc_copies},
 {'[email protected]',
     disc_copies}] = [pubsub_node,muc_registered,pubsub_state,muc_room,
                      sr_group,pubsub_orphan,schema,sr_user,oauth_client,
                      vcard_search,pubsub_index]
[{'[email protected]',
     disc_only_copies}] = [caps_features]
[{'[email protected]',
     disc_only_copies},
 {'[email protected]',
     disc_only_copies}] = [roster,vcard,privacy,pubsub_item,oauth_token,motd,
                           private_storage,archive_msg,offline_msg,
                           roster_version,mqtt_pub,motd_users,archive_prefs,
                           last_activity,push_session]
[{'[email protected]',
     ram_copies}] = [mod_register_ip,ejabberd_commands,bosh]
[{'[email protected]',
     ram_copies},
 {'[email protected]',
     ram_copies}] = [route_multicast,muc_online_room,session_counter,s2s,
                     temporarily_blocked,bytestream,route,pubsub_last_item,
                     session,mqtt_sub,mqtt_session,sip_session]
298 transactions committed, 8 aborted, 0 restarted, 787 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []
ok
======================================================================
([email protected])1> mnesia:info().
---> Processes holding locks <---
---> Processes waiting for locks <---
---> Participant transactions <---
---> Coordinator transactions <---
---> Uncertain transactions <---
---> Active tables <---
oauth_client   : with 0        records occupying 305      words of mem
oauth_token    : with 0        records occupying 5432     bytes on disc
bytestream     : with 0        records occupying 305      words of mem
mqtt_sub       : with 0        records occupying 139      words of mem
mqtt_session   : with 0        records occupying 305      words of mem
mqtt_pub       : with 0        records occupying 5432     bytes on disc
private_storage: with 0        records occupying 5432     bytes on disc
pubsub_orphan  : with 0        records occupying 305      words of mem
pubsub_item    : with 0        records occupying 5432     bytes on disc
pubsub_state   : with 0        records occupying 139      words of mem
pubsub_node    : with 0        records occupying 305      words of mem
pubsub_index   : with 0        records occupying 305      words of mem
pubsub_last_item: with 0        records occupying 305      words of mem
caps_features  : with 0        records occupying 5432     bytes on disc
sip_session    : with 0        records occupying 305      words of mem
offline_msg    : with 0        records occupying 5432     bytes on disc
last_activity  : with 0        records occupying 5432     bytes on disc
roster_version : with 0        records occupying 5432     bytes on disc
roster         : with 0        records occupying 5432     bytes on disc
push_session   : with 0        records occupying 5432     bytes on disc
bosh           : with 0        records occupying 305      words of mem
motd_users     : with 0        records occupying 5432     bytes on disc
motd           : with 0        records occupying 5432     bytes on disc
vcard_search   : with 0        records occupying 305      words of mem
vcard          : with 0        records occupying 5432     bytes on disc
muc_online_room: with 0        records occupying 139      words of mem
muc_registered : with 0        records occupying 305      words of mem
muc_room       : with 0        records occupying 305      words of mem
archive_prefs  : with 0        records occupying 5432     bytes on disc
archive_msg    : with 0        records occupying 5432     bytes on disc
privacy        : with 0        records occupying 5432     bytes on disc
sr_user        : with 0        records occupying 305      words of mem
sr_group       : with 0        records occupying 305      words of mem
mod_register_ip: with 0        records occupying 305      words of mem
temporarily_blocked: with 0        records occupying 305      words of mem
s2s            : with 0        records occupying 305      words of mem
session_counter: with 0        records occupying 305      words of mem
session        : with 0        records occupying 305      words of mem
route_multicast: with 0        records occupying 305      words of mem
route          : with 5        records occupying 412      words of mem
ejabberd_commands: with 161      records occupying 57717    words of mem
schema         : with 42       records occupying 5625     words of mem
===> System info in version "4.19.1", debug level = none <===
opt_disc. Directory "/home/ejabberd/database/[email protected]" is used.
use fallback at restart = false
running db nodes   = ['[email protected]']
stopped db nodes   = []
master node tables = []
remote             = []
ram_copies         = [bosh,bytestream,ejabberd_commands,mod_register_ip,
                      mqtt_session,mqtt_sub,muc_online_room,pubsub_last_item,
                      route,route_multicast,s2s,session,session_counter,
                      sip_session,temporarily_blocked]
disc_copies        = [muc_registered,muc_room,oauth_client,pubsub_index,
                      pubsub_node,pubsub_orphan,pubsub_state,schema,sr_group,
                      sr_user,vcard_search]
disc_only_copies   = [archive_msg,archive_prefs,caps_features,last_activity,
                      motd,motd_users,mqtt_pub,oauth_token,offline_msg,
                      privacy,private_storage,pubsub_item,push_session,roster,
                      roster_version,vcard]
[{'[email protected]',
     disc_copies}] = [schema,sr_group,sr_user,muc_room,muc_registered,
                      vcard_search,pubsub_index,pubsub_node,pubsub_state,
                      pubsub_orphan,oauth_client]
[{'[email protected]',
     disc_only_copies}] = [privacy,archive_msg,archive_prefs,vcard,motd,
                           motd_users,push_session,roster,roster_version,
                           last_activity,offline_msg,caps_features,
                           pubsub_item,private_storage,mqtt_pub,oauth_token]
[{'[email protected]',
     ram_copies}] = [ejabberd_commands,route,route_multicast,session,
                     session_counter,s2s,temporarily_blocked,mod_register_ip,
                     muc_online_room,bosh,sip_session,pubsub_last_item,
                     mqtt_session,mqtt_sub,bytestream]
48 transactions committed, 6 aborted, 0 restarted, 82 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []
ok

@prefiks
Copy link
Member

prefiks commented Aug 21, 2024

Could you try joining node from debug shell instead of ejabberdctl, just execute ejabberd_admin:join_cluster(<<"[email protected]">>). (just update node name as needed). This should allow this operation to not have timeout that all ejabberdctl commands have, and since that operation need to sync database tables between nodes, it's possible that it takes more time than used timeout.

@zzzZLRzzz
Copy link
Author

Could you try joining node from debug shell instead of ejabberdctl, just execute ejabberd_admin:join_cluster(<<"[email protected]">>). (just update node name as needed). This should allow this operation to not have timeout that all ejabberdctl commands have, and since that operation need to sync database tables between nodes, it's possible that it takes more time than used timeout.

still timeout
is the result of mnesia:info() means the table last_activity occupy a lot?
there's just about 6k xmpp connections on ejabberd, i wonder why 'last_activity' and 'session' table cause a lot of occupation?

@Neustradamus
Copy link

@zzzZLRzzz: What is your ejabberd version?
It has been solved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants