Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: With this PR we add the possibility to have multiple connection pools in Orca #4619

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

ovidiupopa07
Copy link
Contributor

With this PR we add the possibility to have multiple connection pools in Orca, one for the writes and one for reads.

Orca’s operations are primarily READ operations from the database (almost 85% of the SQL transactions are SELECT statements). For high-scale customers with hundreds of applications/pipelines with big execution contexts, this translates to extreme pressure towards the backend database and high network utilization, transferring high volumes of data from the database to the Orca pods.

We noticed that for high-scale customers, we have doubled twice the instance types, and now we have reached a situation where the Writer endpoint of the database is getting throttled on the network bandwidth side (hitting the max bandwidth of 20gbps)

On the flip side, the Reader instance of the Orca database sits idle since Orca doesn't support ReadOnly operations through the SQL connection pools. Having Orca splitting the traffic between the READS and WRITES on the database endpoints will dramatically increase the performance/utilization but also provide cost savings for high-scale customers.

We have tested this in our environment; below you can find the statistics
image (11)
image (10)
image (9)
image (12)

@spinnakerbot
Copy link
Contributor

The following commits need their title changed:

Please format your commit title into the form:

<type>(<scope>): <subject>, e.g. fix(kubernetes): address NPE in status check

This allows us to easily generate changelogs & determine semantic version numbers when cutting releases. You can read more about commit conventions here.

Copy link
Contributor

@kkotula kkotula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀 Thank you!

Copy link
Contributor

@dogonthehorizon dogonthehorizon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure to document this change alongside SQL configs for other services. Otherwise LGTM, and would be good to get @dbyron0's take as well.

Copy link
Contributor

@dbyron-sf dbyron-sf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the effort here. Sounds like we've been running into similar struggles with the scalability of orca's database. We have a bit of a different way of solving this...also with multiple connection pools, but using a different way of configuring them that works with the current mechanism (i.e. the "write" connection pool is still named default).

Besides that difference though, we found that using the read replica for some read operations doesn't work....or at least it doesn't work without some other signficant changes to teach orca to be aware of replication lag. When one task writes something to the database and another task runs immediately afterwards and reads, the current code expects to get exactly what was read, but with replication lag that doesn't always happen.

We're still working through the steps to get the changes to handle that rolled out in prod and gain confidence in it.

We have been using a read replica for a subset of operations though. Lemme see if I can get a PR for that.

@jasonmcintosh
Copy link
Member

Curious on the places you've found that don't handle async reads after writes. I know the correlation ids are one we hit. Our solution was to modify those to "ignore if there's a unique constraint post read when it tries to insert". I'd not traced all the places that look for a similar read after write operation that would be impacted, but I THOUGHT most of those stages either did "ignore constraint failures" OR "hey retry the read on next queue operation". That said would NOT surprise me if there are more such places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants