-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handing incoming message while the database is unavailable #279
Comments
So generally oppose options, so would like to pick one either of dropping or queueing of messages. Queuing makes the most sense to me in general but if you feel strongly then drop instead. Maintenance messages need to be PER channel, they can't be global since Courier is handling channels across many different organizations. So that can be something that is part of the config JSON. SQL file sounds mighty weird and annoying in go, JSON marshalling is easy / natural, dumping channel configs to a big JSON blob makes a lot more sense to me. Most of this is already there so don't think implementation should be super difficult. None of the channel implementations should have to change I don't think. Maybe it could be a separate backend entirely, though not entirely sure that makes sense as opposed to a mode that the RapidPro backend works in. Process to "run" this might be something like:
|
yeah, json is probably better. I definitely agree on the messages being per-channel. And we can drop the 'reply' config and just do if the channel maint reply is null then don't reply. |
When we have an outage, we have some people who just won't give up messaging the bot. If it goes on a while, they end up in a supremely weird state when that flood of impatience starts getting processed. That would be my case for dropping. It also removes time pressure to some extent (we aren't running against a clock as the Redis instance fills). If this could be used while Redis also goes away, it would be a mechanism for us to resize/scale redis - not a top priority but I just can't think of how else to scale redis now without a complete mess. EDIT: That said, caching the incoming messages allows for some interesting resiliency gains if the DB goes out unexpectedly I think? (ie. maintenance or other unexpected failover?) I don't want to spiral on this, but there's a lot of ways to think of this capability. Small nuance would be for this to read from ENV first, then from disk to better accommodate running in a containerized environment. Maybe would be nice for courier to write the JSON at startup you'd need in the ENV for later, to make that less error-prone. |
You don't think they would chill out if they got an autoresponder? In theory maintenance mode wouldn't touch anything, including redis. It spools to disk (thats what it currently does) then rereads those in once it is out of it. Not sure writing all the configs for all channels to the environment is really tenable, there are many thousands on various installs. |
In theory why couldn’t it put them on the redis queue if it had the channel configs? |
No reason, there is already support for on disk spooling though. |
Ok I think this approach is making sense. Have the behavior to queue to disk, then if we want to drop them we just empty the spool dir before restarting in normal mode. The one thing we really want to do and we're open to ideas- we want to send a single "we're back" message to everyone who messaged during the outage. The difficulty is making it unique users only (if I messaged 3 times, I don't need 3 "we're back" messages). The best I came up with so far is to write a separate bin to go through the spool files and find all the unique contacts and use the API to kick off a new flow/message for each of them. |
Ya, either that or a manual python query to look for all messages that were created within a window and then using that as a target for messages to send out. But ya, that starts getting into such specific behavior that I would very much not want that as an additional "option" within Courier / RapidPro. |
So I think we're in agreement- do what you outlined here: #279 (comment) If so I'll get started on it unless you think you can quickly. |
Yep, sounds good. This is your baby! :) |
Sorry, late to this, I will add, and Lisa will agree, that a lot of people don't read the autoresponder and will continue to text us over and over again expecting a response. This happens with Bot on Fire all the time, hence why we trap them in purgatory until the release flow fires. Because of this behavior, we'd need to make it so we respond to only the initial inbound, and then not respond further, at some point they will get the hint. We can then do a push once we're back up to the folks that texted during the outage. This was a problem with doing this from Twilio, Twilio functions had no way to store who wrote during an outage, hence all we could do was respond again and again, which ran up our texting bill. |
As mentioned in the google group, we'd like to add the capability of Courier to either switch to a maintenance mode or be started in one.
Here's what i'm envisioning:
This mode would allow a caching of the channel config from the database so it can run while the DB is down. In this mode, it would either queue or drop incoming messages, and auto-respond to them or not. Seems the best way is to add it to the channel config table- two boolean columns for
maint_queue_messages
andmaint_autorespond
, and one string one formaint_message
.When courier was started in maint mode, it would read the channel config table in the db, and hold the entire config in memory, and also write a copy of it to a local sql file. If it cannot connect to the db in maint mode, it looks for the local sql file and reads that into memory.
Once running it would handle the messages as we configured based on the channel.
as far as implementation, i'm interested in your thoughts. The idea would be to keep as much of the courier code intact and in-use in this mode as possible so it doesn't add extra overhead to maintain as courier moves forward
The text was updated successfully, but these errors were encountered: