Handing incoming message while the database is unavailable #279

tybritten · 2020-01-21T17:36:21Z

As mentioned in the google group, we'd like to add the capability of Courier to either switch to a maintenance mode or be started in one.

Here's what i'm envisioning:
This mode would allow a caching of the channel config from the database so it can run while the DB is down. In this mode, it would either queue or drop incoming messages, and auto-respond to them or not. Seems the best way is to add it to the channel config table- two boolean columns for maint_queue_messages and maint_autorespond, and one string one for maint_message.
When courier was started in maint mode, it would read the channel config table in the db, and hold the entire config in memory, and also write a copy of it to a local sql file. If it cannot connect to the db in maint mode, it looks for the local sql file and reads that into memory.

Once running it would handle the messages as we configured based on the channel.

as far as implementation, i'm interested in your thoughts. The idea would be to keep as much of the courier code intact and in-use in this mode as possible so it doesn't add extra overhead to maintain as courier moves forward

The text was updated successfully, but these errors were encountered:

nicpottier · 2020-01-21T17:48:40Z

So generally oppose options, so would like to pick one either of dropping or queueing of messages. Queuing makes the most sense to me in general but if you feel strongly then drop instead.

Maintenance messages need to be PER channel, they can't be global since Courier is handling channels across many different organizations. So that can be something that is part of the config JSON.

SQL file sounds mighty weird and annoying in go, JSON marshalling is easy / natural, dumping channel configs to a big JSON blob makes a lot more sense to me.

Most of this is already there so don't think implementation should be super difficult. None of the channel implementations should have to change I don't think. Maybe it could be a separate backend entirely, though not entirely sure that makes sense as opposed to a mode that the RapidPro backend works in.

Process to "run" this might be something like:

% courier -cache-channel-confs - this just starts couriers, loads all channels and dumps their configs to a single file exiting immediately, would have to be run on each box since storage is local
% courier -maintenance - starts courier in maintenance mode, reading all channel configs from file and spooling all incoming messages / statuses
a normal courier restart should get the spool written out

tybritten · 2020-01-21T17:54:24Z

yeah, json is probably better. I definitely agree on the messages being per-channel. And we can drop the 'reply' config and just do if the channel maint reply is null then don't reply.

chris-erickson · 2020-01-21T18:02:04Z

When we have an outage, we have some people who just won't give up messaging the bot. If it goes on a while, they end up in a supremely weird state when that flood of impatience starts getting processed. That would be my case for dropping. It also removes time pressure to some extent (we aren't running against a clock as the Redis instance fills). If this could be used while Redis also goes away, it would be a mechanism for us to resize/scale redis - not a top priority but I just can't think of how else to scale redis now without a complete mess.

EDIT: That said, caching the incoming messages allows for some interesting resiliency gains if the DB goes out unexpectedly I think? (ie. maintenance or other unexpected failover?) I don't want to spiral on this, but there's a lot of ways to think of this capability.

Small nuance would be for this to read from ENV first, then from disk to better accommodate running in a containerized environment. Maybe would be nice for courier to write the JSON at startup you'd need in the ENV for later, to make that less error-prone.

nicpottier · 2020-01-21T18:20:48Z

You don't think they would chill out if they got an autoresponder?

In theory maintenance mode wouldn't touch anything, including redis. It spools to disk (thats what it currently does) then rereads those in once it is out of it.

Not sure writing all the configs for all channels to the environment is really tenable, there are many thousands on various installs.

tybritten · 2020-01-21T18:22:49Z

In theory why couldn’t it put them on the redis queue if it had the channel configs?

nicpottier · 2020-01-21T18:23:45Z

No reason, there is already support for on disk spooling though.

tybritten · 2020-01-21T19:39:30Z

Ok I think this approach is making sense. Have the behavior to queue to disk, then if we want to drop them we just empty the spool dir before restarting in normal mode.

The one thing we really want to do and we're open to ideas- we want to send a single "we're back" message to everyone who messaged during the outage. The difficulty is making it unique users only (if I messaged 3 times, I don't need 3 "we're back" messages). The best I came up with so far is to write a separate bin to go through the spool files and find all the unique contacts and use the API to kick off a new flow/message for each of them.

nicpottier · 2020-01-27T17:55:58Z

Ya, either that or a manual python query to look for all messages that were created within a window and then using that as a target for messages to send out. But ya, that starts getting into such specific behavior that I would very much not want that as an additional "option" within Courier / RapidPro.

tybritten · 2020-01-27T18:00:59Z

So I think we're in agreement- do what you outlined here: #279 (comment)

If so I'll get started on it unless you think you can quickly.

nicpottier · 2020-01-27T18:02:04Z

Yep, sounds good.

This is your baby! :)

putorti · 2020-01-27T18:37:48Z

Sorry, late to this, I will add, and Lisa will agree, that a lot of people don't read the autoresponder and will continue to text us over and over again expecting a response. This happens with Bot on Fire all the time, hence why we trap them in purgatory until the release flow fires.

Because of this behavior, we'd need to make it so we respond to only the initial inbound, and then not respond further, at some point they will get the hint. We can then do a push once we're back up to the folks that texted during the outage. This was a problem with doing this from Twilio, Twilio functions had no way to store who wrote during an outage, hence all we could do was respond again and again, which ran up our texting bill.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handing incoming message while the database is unavailable #279

Handing incoming message while the database is unavailable #279

tybritten commented Jan 21, 2020

nicpottier commented Jan 21, 2020

tybritten commented Jan 21, 2020

chris-erickson commented Jan 21, 2020 •

edited

Loading

nicpottier commented Jan 21, 2020

tybritten commented Jan 21, 2020

nicpottier commented Jan 21, 2020

tybritten commented Jan 21, 2020

nicpottier commented Jan 27, 2020

tybritten commented Jan 27, 2020

nicpottier commented Jan 27, 2020

putorti commented Jan 27, 2020

Handing incoming message while the database is unavailable #279

Handing incoming message while the database is unavailable #279

Comments

tybritten commented Jan 21, 2020

nicpottier commented Jan 21, 2020

tybritten commented Jan 21, 2020

chris-erickson commented Jan 21, 2020 • edited Loading

nicpottier commented Jan 21, 2020

tybritten commented Jan 21, 2020

nicpottier commented Jan 21, 2020

tybritten commented Jan 21, 2020

nicpottier commented Jan 27, 2020

tybritten commented Jan 27, 2020

nicpottier commented Jan 27, 2020

putorti commented Jan 27, 2020

chris-erickson commented Jan 21, 2020 •

edited

Loading