Skip to content
This repository has been archived by the owner on Mar 4, 2024. It is now read-only.

[question] Understanding how to add new servers #440

Closed
mdorier opened this issue Jun 22, 2023 · 15 comments
Closed

[question] Understanding how to add new servers #440

mdorier opened this issue Jun 22, 2023 · 15 comments

Comments

@mdorier
Copy link

mdorier commented Jun 22, 2023

I'm trying to understand the procedure to add a new server to an existing cluster. Here is what I do to initialize the cluster, currently, on all its processes:

struct raft_configuration conf = {0};
raft_configuration_init(&conf);
for(unsigned i = 0; i < cluster_size; i++) {
        char* address = ...; /* get address for member i */
        raft_id id = i;
        raft_configuration_add(&conf, id, address, RAFT_VOTER);
}
ret = raft_bootstrap(r, &conf);
raft_configuration_close(&conf);

Now if I want to add a new server and convert it into a voter, I know I need to have the leader of the current cluster call raft_add with the ID and address of the new server (which I can easily communicate to the leader), then call raft_assign to make it a voter.

However what is not clear to me is what the new server should be doing. Should I just call raft_init then raft_start without calling raft_bootstrap? Should I call raft_bootstrap with a configuration that only includes the new server? Or should the new server obtain the current list of servers via some other ways and run the above code as well?

@cole-miller
Copy link
Contributor

If you bootstrap using a configuration that already contains all the servers you plan to have in the cluster, you shouldn't raft_add those servers again -- their membership has already been recorded in the Raft log and committed as part of the bootstrap process. You can just raft_start those servers, and the leader (the server that called raft_bootstrap) will contact them to replicate log entries, including the one that describes the initial configuration. You should never call raft_bootstrap on more than one node in the cluster.

An alternative approach, as used by dqlite for example, is to have the first server call raft_bootstrap with a configuration that contains only itself. Then as each other server starts up, you call raft_add on the bootstrap server to have it join the cluster. But you shouldn't mix this with the previous strategy.

Does that make sense? I'm happy to answer follow-up questions.

@cole-miller
Copy link
Contributor

Or to keep it short...

Should I just call raft_init then raft_start without calling raft_bootstrap?

Yes, this is what you should do on non-bootstrap nodes. If (and only if) the server was not part of the bootstrap configuration, you will have to raft_add it on the leader in order for it to participate in the cluster.

@freeekanayaka
Copy link
Contributor

If you bootstrap using a configuration that already contains all the servers you plan to have in the cluster, you shouldn't raft_add those servers again -- their membership has already been recorded in the Raft log and committed as part of the bootstrap process. You can just raft_start those servers, and the leader (the server that called raft_bootstrap) will contact them to replicate log entries, including the one that describes the initial configuration. You should never call raft_bootstrap on more than one node in the cluster.

Just a clarification: you can actually call raft_bootstrap() on more than one node. The code that @mdorier posted, where he calls raft_bootstrap() on all the servers part of the initial configuration is correct.

@cole-miller
Copy link
Contributor

cole-miller commented Jun 22, 2023

@freeekanayaka Thanks -- I guess I mistakenly transferred the dqlite requirement to have only one bootstrap node to raft in my head.

@mdorier
Copy link
Author

mdorier commented Jun 22, 2023

I was indeed calling it on more than one node. I think it works because they will all effectively write the same entry with the exact same configuration, so there will just be no need for the leader to send the configuration over to the followers.

I think I understand better now. The pattern of starting with one server that bootstraps and adding servers one by one is what I'm trying to do. So I can just call raft_init + raft_start on the new server and raft_add + raft_assign on the leader, if I understand correctly.

@freeekanayaka
Copy link
Contributor

I'm trying to understand the procedure to add a new server to an existing cluster. Here is what I do to initialize the cluster, currently, on all its processes:

struct raft_configuration conf = {0};
raft_configuration_init(&conf);
for(unsigned i = 0; i < cluster_size; i++) {
        char* address = ...; /* get address for member i */
        raft_id id = i;
        raft_configuration_add(&conf, id, address, RAFT_VOTER);
}
ret = raft_bootstrap(r, &conf);
raft_configuration_close(&conf);

This is correct.

Now if I want to add a new server and convert it into a voter, I know I need to have the leader of the current cluster call raft_add with the ID and address of the new server (which I can easily communicate to the leader), then call raft_assign to make it a voter.

This is also correct.

However what is not clear to me is what the new server should be doing. Should I just call raft_init then raft_start without calling raft_bootstrap?

Yes, precisely

Should I call raft_bootstrap with a configuration that only includes the new server?

No, that would lead to an inconsistent cluster. What raft_bootstrap() does is to simply write the first entry in the log, which contains the initial configuration. All servers part of the initial configuration should call raft_bootstrap() with such configuration. If you call raft_bootstrap() on a new server and pass it a different configuration, then entry 1 of that server will differ from entry 1 on all other servers.

Or should the new server obtain the current list of servers via some other ways and run the above code as well?

The new server will obtain the current list as part of log replication, since other servers will send it entry 1.

@freeekanayaka
Copy link
Contributor

freeekanayaka commented Jun 22, 2023

I was indeed calling it on more than one node. I think it works because they will all effectively write the same entry with the exact same configuration, so there will just be no need for the leader to send the configuration over to the followers.

I think I understand better now. The pattern of starting with one server that bootstraps and adding servers one by one is what I'm trying to do. So I can just call raft_init + raft_start on the new server and raft_add + raft_assign on the leader, if I understand correctly.

Yes, you can do that too if you will, which is what @cole-miller described and what dqlite does. But it's not mandatory, as long as you follow the rules above (which are the standard Raft rules from the paper).

@freeekanayaka
Copy link
Contributor

@freeekanayaka Thanks -- I guess I mistakenly transferred the dqlite requirement to have only one bootstrap node to raft in my head.

FWIW, it's more an implementation detail than a requirement. In fact, we might want to support also the more-than-one-server-in-initial-configuration scenario in the future, and we wouldn't break any fundamental dqlite assumption. It's just there wasn't such a use case yet.

@mdorier
Copy link
Author

mdorier commented Jun 22, 2023

Thanks to both of you for the clarification! Closing the issue.

@mdorier mdorier closed this as completed Jun 22, 2023
@mdorier
Copy link
Author

mdorier commented Jun 28, 2023

I'm going to re-open this issue as there are still some things I don't understand. I am implementing my own raft_io backend and trying to get each function to work properly. I must not understand something correctly as I'm getting a segfault in the joining process.

Following the discussion above I decided to try having just one process call raft_bootstrap, with a configuration that includes only itself, then add new processes using raft_add and raft_assign. This is what I see happening:

On the leader (initial process), at start time:

  • raft_bootstrap: calls the underlying log's bootstrap function.
  • raft_start: calls the underlying log's load function, which gives term=1, start_index=0, voted_for=0, n_entries=1, and a snapshot containing a configuration with 1 server.

The process prints the following:

[craft] [src/start.c:155] starting
[log] load: term=1 start_index=0 voted_for=0 n_entries=1, servers=0
[craft] [src/start.c:163] current_term:1 voted_for:0 start_index:0 n_entries:1
[craft] [src/start.c:168] restore snapshot with last index 0 and last term 1
[craft] [src/configuration.c:342] configuration restore from snapshot
[craft] [src/configuration.c:343] === CONFIG START ===
[craft] [src/configuration.c:348] id:1 address:ofi+tcp://192.168.5.15:34251 role:1
[craft] [src/configuration.c:350] === CONFIG END ===
[craft] [src/start.c:201] restore 1 entries starting at 0
[craft] [src/convert.c:23] old_state:0 new_state:1
[craft] [src/convert.c:38] clear follower state
[craft] [src/convert.c:23] old_state:1 new_state:2
[craft] [src/convert.c:168] self elect and convert to leader
[craft] [src/convert.c:49] clear candidate state
[craft] [src/convert.c:23] old_state:2 new_state:3

On the joining process, at start time:

  • raft_start: calls the underlying log's load function, which gives term=0, start_index=0, voted_for=0, n_entries=0, and a snapshot containing a configuration with 0 server.

The process prints the following:

[craft] [src/start.c:155] starting
[log] load: term=0 start_index=0 voted_for=0 n_entries=0 servers=0
[craft] [src/start.c:163] current_term:0 voted_for:0 start_index:0 n_entries:0
[craft] [src/start.c:168] restore snapshot with last index 0 and last term 0
[craft] [src/configuration.c:342] configuration restore from snapshot
[craft] [src/configuration.c:343] === CONFIG START ===
[craft] [src/configuration.c:350] === CONFIG END ===
[craft] [src/start.c:201] restore 0 entries starting at 0
[craft] [src/convert.c:23] old_state:0 new_state:1

Then when the leader is informed of the new process, it calls raft_add, followed by raft_assign, and prints the following:

// raft_add call
[craft] [src/client.c:184] add server: id 2, address ofi+tcp://192.168.5.15:38785
[log] append
[craft] [src/replication.c:474] leader: written 1 entries starting at 1: status 0
[craft] [src/replication.c:1741] new commit index 1
[craft] [src/replication.c:1421] configuration at index:1 is committed.
// raft_assign
[craft] [src/client.c:228] raft_assign to id:2 the role:1
[craft] [src/replication.c:99] send 0 entries starting at 1 to server 2 (last index 1)

(the last line repeats several times, I think because it's an append entry call that's getting re-tried over and over even though the other process has crashed at this point)

On the joining process, I see the following:

[craft] [src/recv_append_entries.c:36] self:2 from:1@ofi+tcp://192.168.5.15:34251 leader_commit:1 n_entries:0 prev_log_index:1 prev_log_term:1, term:1
[craft] [src/recv.c:122] remote term 1 is higher than 0 -> bump local term
[log] set_term
Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))

The log's set_term completes fine. The backtrace shows that the process crashed in RAFT's code, here, in logTermOf. The full backtrace down RAFT's code is the following:

#0  0x0000ffffb31970fc in logTermOf (index=<optimized out>, l=0xaaaaba11ada0) at src/log.c:644
#1  logTermOf (l=0xaaaaba11ada0, index=<optimized out>) at src/log.c:620
#2  0x0000ffffb319c2e0 in checkLogMatchingProperty (args=0xffffad86d010, r=0xfffffae848d8) at src/replication.c:959
#3  replicationAppend (r=r@entry=0xfffffae848d8, args=args@entry=0xffffad86d010, rejected=rejected@entry=0xffffad86cf20,
    async=async@entry=0xffffad86cef3) at src/replication.c:1076
#4  0x0000ffffb3199150 in recvAppendEntries (r=r@entry=0xfffffae848d8, id=1, address=0xffffa4000ad0 "ofi+tcp://192.168.5.15:34251",
    args=args@entry=0xffffad86d010) at src/recv_append_entries.c:119
#5  0x0000ffffb3198b78 in recvMessage (message=0xffffad86cff8, r=0xfffffae848d8) at src/recv.c:27
#6  recvCb (io=<optimized out>, message=0xffffad86cff8) at src/recv.c:103

Since the only thing different between my current code and what I had before is that now only one process calls bootstrap and so the other processes (1) are not initially part of the cluster and (2) don't have anything in their log to begin with, and since the only function that's called on the log that gives information to RAFT is the load function, I'm guessing my mistake is in this function. Am I returning correct values for a call to load that does not find any pre-existing state (namely term=0, start_index=0, voted_for=0, n_entries=0, and a snapshot containing a configuration with 0 server)? Or maybe I am missing a step in initializing the joining process?

Thanks for your help!

@mdorier mdorier reopened this Jun 28, 2023
@freeekanayaka
Copy link
Contributor

I'm going to re-open this issue as there are still some things I don't understand. I am implementing my own raft_io backend and trying to get each function to work properly.

I'm not sure when I'll have time to look at the details of this, maybe @cole-miller or @MathieuBordere will be able to do that.

However, before spending too much time on this, may I ask why you need to implement your own raft_io? I presume you need to run this library on a non-Linux system? Or you have some exotic hardware at hand,
perhaps a HPC platform of some sort.

Implementing raft_io correctly might seem easy at first sight, and I believe other people have done so or at least attempted it. In reality, if want to get a production grade raft_io, that's not straightforward at all. So I'd recommend to be sure there are absolutely no other alternatives.

Also, beware that I think the current raft_io API (and in more general some of the APIs around the raft object) have reached a point where we should redesign them a bit. There's a proposal about a v1 version of this raft library here #430. Not sure if that will fly, but in that case we'll surely keep backward compatibility with v0 for a good while, although not indefinitely. It won't be a dramatic change for "normal" consumers of the current v0 API, but the impact would be a bit surely greater if you also have a custom raft_io implementation that you need to adapt too. Just my personal opinion, in case you are having long term plans about what you are doing.

@mdorier
Copy link
Author

mdorier commented Jun 28, 2023

However, before spending too much time on this, may I ask why you need to implement your own raft_io? I presume you need to run this library on a non-Linux system? Or you have some exotic hardware at hand,
perhaps a HPC platform of some sort.

That's pretty much the reason. We have HPC systems with specific network and storage hardware. I don't think it's that complicated, unfortunately the precise semantics of each function isn't very well documented.

@mdorier
Copy link
Author

mdorier commented Jun 28, 2023

Oh I think I got it working. The load function in the joining process should have set start_index to 1 rather than 0 (seeing what the uv implementation does). I'll continue testing and probably come back with some more questions about the next function I get wrong.

Note: as I'm doing this, I'm writing down the exact semantics of each function of the raft_io structure (I have a bunch that already work right and are tested), including expectations about ownership of memory being passed to those functions. Once I get the full implementation working, I'll clean that up and send it to you guys (though if you plan to rework the raft_io system it may be too late).

@freeekanayaka
Copy link
Contributor

Oh I think I got it working.

Great!

The load function in the joining process should have set start_index to 1 rather than 0 (seeing what the uv implementation does). I'll continue testing and probably come back with some more questions about the next function I get wrong.

Note: as I'm doing this, I'm writing down the exact semantics of each function of the raft_io structure (I have a bunch that already work right and are tested), including expectations about ownership of memory being passed to those functions. Once I get the full implementation working, I'll clean that up and send it to you guys (though if you plan to rework the raft_io system it may be too late).

I think that would be valuable regardless, thanks. I'm not entirely sure if and when v1 will see life.

@freeekanayaka
Copy link
Contributor

However, before spending too much time on this, may I ask why you need to implement your own raft_io? I presume you need to run this library on a non-Linux system? Or you have some exotic hardware at hand,
perhaps a HPC platform of some sort.

That's pretty much the reason. We have HPC systems with specific network and storage hardware. I don't think it's that complicated, unfortunately the precise semantics of each function isn't very well documented.

It's probably not too complicated to get something working, however it's non trivial to have it sport good performance and to have it behave 100% correctly under the most harsh conditions and pathological scenarios, something that our Jepsen-based dqlite test suite exercises.

Anyway, if it's open source if could share pointers that would be great, we could include it in our README, for folks with similar needs.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants