3.11vnodesupport #623

zmarois · 2017-11-08T20:17:28Z

Fixes #618

I believe this adds support for vnodes to the 3.11 branch. ~~I still have some integration testing to do~~, but I think I have unit tests covering the new features.

Should I also submit a PR against 3.x? How about master?

zmarois · 2017-11-08T20:27:07Z

A few overarching changes required:

Make startup not require a token. When using vnodes, we don't need to generate a token
Separate the notion of the initial token from the backup identifier for the node. When using vnodes, we still need a unique identifier for the backup, but all 256 initial tokens seems too long ~~, and we still need to set initial token at startup when loading from a backup, so we can't overload token as backup id because both are used for vnodes.~~
* Save the instance's tokens in the snapshot in order to start the node loading it with the initial_token that was used when taking the snapshot, step 5 in this overview (Outdated because I stripped setting tokens from backup state)

arunagrawal84 · 2017-11-13T20:57:21Z

@zmarois sorry I have not looked yet into your pull request. I am on vacation. I have added reviewers to your pull request for time being.
Also, to answer your question, yes please submit a pull request for 3.x as well because we are trying to keep 3.x and 3.11 in feature parity here. Reason being 3.x is probably most widely used branch. master at this point is too far from 3.x and we plan to merge 3.x into master but it has been delayed due to new features/improvements.

zmarois · 2017-11-13T21:06:52Z

No worries @arunagrawal84. I've got a fork I'm building and integration testing with; I just found a problem today that I have since patched in, so I'm not 100% sure that this is done. I appreciate any feedback the reviewers you added have, and I'll comment when I have integration tested this better, including restoring from backups.

zmarois · 2017-11-22T20:05:32Z

I've now tested backups and restores using this change and have confirmed the data loads correctly. However, I also tried a restore without the complexity of setting the initial_token to be the same as when the backup was taken, per step 5 in this overview, and saw the data still load correctly. Perhaps my comparison of the data performed a read-repair of all the data, and that resolved data being initially loaded onto the wrong nodes. Further, I'm not quite sure how after setting initial_token on restore I'd then be able to add nodes that are vnodes, so I'd like to remove the complexity of setting initial_token when restoring from a backup at all, or at least make it configurable to do so.

I've stripped that complexity, and filed the PR against the 3.x branch.

One more thing to note: I didn't do my testing against this branch itself, but a fork with a few more changes/features to make this easier to install into my current infrastructure, namely:

The ability to run outside of an Auto-scaling group. The ASG is used for a few purposes, which I needed to provide an alternative implementation for:
- To calculate balanced token ranges from the total cluster size when not using vnodes. As I am using vnodes, I chose to limit this change to only being supported for vnode instances, but could be implemented with an instance name pattern.
- To determine if a node registered in SDB's InstanceIdentity domain is still living or not, which I just implemented as "is this node in the list of all instances returned by ec2 client's describeInstances"
- To wait for all nodes (thus all nodes need to be known, which the ASG provides) to be registered with SDB before returning seeds when running in a single-zone deployment. I have to admit, I don't understand the reasoning for this edge case. I could implement this again with an EC2 instance name pattern, but as I'm not testing with a single AZ, I just left this case as not implemented under the vnode feature. Further, a large benefit of vnodes is the simpler flexibility in changing the size of the cluster, so the notion of "are all up" is less applicable.
- To determine the app id to query SDB with for configuration, prior to the configuration which is in SDB specifying the cluster name. This was already inferred from system property ASG_NAME if it was set, so I just used that.
The ability to run nodes on a private network in a private VPC, related to the conversation here.
A static seed definition. I do not dislike Priam's mechanism to provide seeds, but this was useful to both add nodes running Priam to an existing cluster that is not.
The ability to run both SDB domains in the same region
The ability to configure credentials at runtime instead of compile time, just so I didn't need to build different artifacts to vary these.
Auth on the jmx port.
Ability to override some configuration from system properties that are in the properties config file by inverting the precedence
A check when replacing a dead node that some node in the existing cluster actually thinks its dead, as a node that failed to ever really join the cluster could be in SDB, but the replacement node can't actually replace its place in the cluster. Really just helpful during testing when I was having nodes not come up due to configuration issues.
Actually setting listen_address to something other than null. I have to admit, per the docs on this property, I do not understand how this works being hard-coded to null for anything but single-node installation, unless listen_interface is set in the config externally
Letting backupStatusFileLocation be variable.
If you'd consider any of these features useful, I'd gladly file separate feature PRs for them.
Shutting down Quartz gracefully

(Edited comment to change links to more concise commit links)

costimuraru · 2017-11-24T12:57:07Z

@zmarois nice work! I look forward to seeing this PR merged, so that we can benefit from VNodes.

zmarois · 2017-11-27T17:38:56Z

...-cass-extensions/src/main/java/com/netflix/priam/cassandra/extensions/PriamStartupAgent.java

        String replacedIp = "";
        String extraEnvParams = null;

        while (true)
        {
            try
            {
-                token = DataFetcher.fetchData("http://127.0.0.1:8080/Priam/REST/v1/cassconfig/get_token");
+                isExternallyDefinedToken = Boolean.parseBoolean(DataFetcher.fetchData("http://127.0.0.1:8080/Priam/REST/v1/cassconfig/is_externally_defined_token"));
+                if (isExternallyDefinedToken) {


I needed to allow token to not be set, but still distinguish between Priam being fully initialized but not having a token and it not running at all.

I could have instead differentiated these two cases with a 500 when it hasn't fully initialized, and make this startup agent treat 500s different than a 204.

…n one token.

zmarois · 2017-11-27T18:43:21Z

priam/src/main/java/com/netflix/priam/identity/InstanceIdentity.java

@@ -104,7 +107,7 @@ public InstanceIdentity(IPriamInstanceFactory factory, IMembership membership, I
        init();
    }

-    public PriamInstance getInstance() {
+    PriamInstance getInstance() {


Token was previously used both as the initial_token and as the identifier of a node's backup in the context of the entire cluster.

When using vnodes, I don't need an initial token, but I do still need an identifier for the backup, so I wanted to separate these two concerns. Because the backup identifier is redundant with data already in SDB, it felt better not to put in in the PriamInstance class, and instead use PriamInstance purely as the Data Access DTO, but InstanceIdentity as the business model.

zmarois · 2017-11-27T18:51:36Z

priam/src/main/java/com/netflix/priam/resources/RestoreServlet.java

-    /**
-     * Find closest token in the specified region
-     */
-    private String closestToken(String token, String region) {


I moved this into RestoreTokenSelector to keep the logic for determining the closest token in one place

zmarois · 2017-11-27T18:51:58Z

priam/src/main/java/com/netflix/priam/restore/RestoreTokenSelector.java

@@ -54,11 +62,52 @@ public RestoreTokenSelector(ITokenManager tokenManager, @Named("backup") IBackup
     *            Date for which the backups are available
     * @return Token as BigInteger
     */
-    public BigInteger getClosestToken(BigInteger tokenToSearch, Date startDate) {
+    public BigInteger getClosestToken(String tokenToSearch, Date startDate)


The changes to this are purely to improve the error message when finding the closest token is invalid, because the snapshot was taken with multiple tokens.

zmarois · 2017-11-27T18:52:13Z

priam/src/test/java/com/netflix/priam/identity/DoubleRingTest.java

@@ -15,19 +15,13 @@
 *
 */

-package com.netflix.priam.backup.identity;
+package com.netflix.priam.identity;


I moved these tests to same package as the classes they test, so they have access to the now-package-private instance properties.

arunagrawal84 · 2017-12-05T17:31:35Z

@zmarois wanted to give more context: This requires some changes in our infrastructure to accommodate vnodes which is why it is taking longer than expected.

zmarois · 2017-12-05T18:30:53Z

@arunagrawal84 no worries. I appreciate the check-in. I tried to write this in a way in which, if you still have num_tokens set to 1 (which is still the default), Priam should operate identical to how it did without this branch.

If you have concrete feedback on something I did here that makes Priam not backwards-compatible, let me know; I'd gladly fix it. If its just something internal, perhaps your own fork that I do not realize isn't compatible with this, I understand.

darkpssngr · 2018-03-22T18:23:07Z

Any updates on this ?

zmarois · 2018-03-22T18:55:22Z

FWIW, we have been running a build of our fork with this in our production environment for about two months. @darkpssngr feel free to try out a build of our 3.11 branch.

Note it does have a few more changes in it, namely:

A static seed definition. I do not dislike Priam's mechanism to provide seeds, but this was useful to both add nodes running Priam to an existing cluster that is not.
A check when replacing a dead node that some node in the existing cluster actually thinks its dead, as a node that failed to ever really join the cluster could be in SDB, but the replacement node can't actually replace its place in the cluster. Really just helpful during testing when I was having nodes not come up due to configuration issues.
Ability to override some configuration from system properties that are in the properties config file by inverting the precedence
The ability to configure credentials at runtime instead of compile time, just so I didn't need to build different artifacts to vary these.
The ability to run outside of an Auto-scaling group. Only possible with vnodes because we don't need to calculate token ranges (based on total cluster size defined by ASG)
Actually setting listen_address to something other than null. I have to admit, per the docs on this property, I do not understand how this works being hard-coded to null for anything but single-node installation, unless listen_interface is set in the config externally
Not validating md5 in multipart upload, which doesn't work for encrypted S3 buckets, which I still want to push back into the Netflix repo when I get a chance.

Other than setting listen_address (which I don't understand) I don't think any of these would be breaking from the base Netflix repo if you don't explicitly opt into them, so you should be able to get back to base easy enough.

darkpssngr · 2018-03-23T17:21:57Z

@zmarois Cool thanks . I'll try it out. Ours is a new cluster. So, I dont think backward compatibility is an issue here 👍

zmarois · 2018-03-23T17:51:15Z

Fair. Or consider it forwards compatibility with the base Netflix branch if you switch back to it.

sbilello · 2018-10-22T22:41:40Z

@zmarois and @darkpssngr Do you currently use this: https://github.com/Cimpress-MCP/Priam ?

zmarois · 2018-10-23T13:23:28Z

@sbilello unfortunately, no. My team no longer uses that. We did use it for about 6 months and it did enable us to perform a restore of a lost node at a particularly troublesome time. However, we have since moved off of Cassandra; we just couldn't justify the operational cost of managing it compared to using a hosted solution (FWIW, we actually ended up going with a hosted Postgres database, which also trades off some performance against the costs of hosted Cassandra).

If you would like to pick it up, I'd gladly fill you in with whatever knowledge I have of it. My largest open concern with it regarded assigning the correct token ranges on restore, per step 5 in this overview. While on a full restore I had seen all data seem to propagate, it did balance it out on restoring (which just takes too long), so I think something needs to be done to pre-assign the token ranges such that nodes start up and take the correct token ranges for the restored data.

I believe @darkpssngr , per this comment opted to utilize ebs snapshots. I dunno if that has worked well for them or if they have tested a partial or full restore with it.

trashguy · 2019-01-08T21:58:55Z

Any movement on this?

arunagrawal84 · 2019-01-08T22:11:44Z

vnodes are more complicated w.r.t. backups and restores. The token maps which an instance holds can change considerably between a backup and its subsequent incremental backup. Restoring a cluster with vnodes is another challenge.
We want to pursue vnodes only when we have a complete solution for backups and restores, which produces consistent results.

arunagrawal84 requested review from arunagrawal84, jolynch and vinaykumarchella November 13, 2017 20:54

zmarois force-pushed the 3.11vnodesupport branch 3 times, most recently from 9859f59 to 0712a9e Compare November 27, 2017 16:08

zmarois commented Nov 27, 2017

View reviewed changes

Separating concern of token from backup identifier to enable more tha…

8f7a70f

…n one token.

zmarois force-pushed the 3.11vnodesupport branch from fc85bb4 to 8f7a70f Compare November 27, 2017 18:42

zmarois commented Nov 27, 2017

View reviewed changes

arunagrawal84 mentioned this pull request Jan 2, 2018

Is it possible to use priam only for backup? #649

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.11vnodesupport #623

3.11vnodesupport #623

zmarois commented Nov 8, 2017 •

edited

Loading

zmarois commented Nov 8, 2017 •

edited

Loading

arunagrawal84 commented Nov 13, 2017

zmarois commented Nov 13, 2017

zmarois commented Nov 22, 2017 •

edited

Loading

costimuraru commented Nov 24, 2017

zmarois Nov 27, 2017

zmarois Nov 27, 2017

zmarois Nov 27, 2017

zmarois Nov 27, 2017

zmarois Nov 27, 2017

arunagrawal84 commented Dec 5, 2017

zmarois commented Dec 5, 2017

darkpssngr commented Mar 22, 2018

zmarois commented Mar 22, 2018

darkpssngr commented Mar 23, 2018

zmarois commented Mar 23, 2018

sbilello commented Oct 22, 2018 •

edited

Loading

zmarois commented Oct 23, 2018

trashguy commented Jan 8, 2019

arunagrawal84 commented Jan 8, 2019

3.11vnodesupport #623

Are you sure you want to change the base?

3.11vnodesupport #623

Conversation

zmarois commented Nov 8, 2017 • edited Loading

zmarois commented Nov 8, 2017 • edited Loading

arunagrawal84 commented Nov 13, 2017

zmarois commented Nov 13, 2017

zmarois commented Nov 22, 2017 • edited Loading

costimuraru commented Nov 24, 2017

zmarois Nov 27, 2017

Choose a reason for hiding this comment

zmarois Nov 27, 2017

Choose a reason for hiding this comment

zmarois Nov 27, 2017

Choose a reason for hiding this comment

zmarois Nov 27, 2017

Choose a reason for hiding this comment

zmarois Nov 27, 2017

Choose a reason for hiding this comment

arunagrawal84 commented Dec 5, 2017

zmarois commented Dec 5, 2017

darkpssngr commented Mar 22, 2018

zmarois commented Mar 22, 2018

darkpssngr commented Mar 23, 2018

zmarois commented Mar 23, 2018

sbilello commented Oct 22, 2018 • edited Loading

zmarois commented Oct 23, 2018

trashguy commented Jan 8, 2019

arunagrawal84 commented Jan 8, 2019

zmarois commented Nov 8, 2017 •

edited

Loading

zmarois commented Nov 8, 2017 •

edited

Loading

zmarois commented Nov 22, 2017 •

edited

Loading

sbilello commented Oct 22, 2018 •

edited

Loading