-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: adds scaling documentation #214
Conversation
with placeholder document for configuring the event bus.
Thanks for the pull request, @pomegranited! What's next?Please work through the following steps to get your changes ready for engineering review: 🔘 Get product approvalIf you haven't already, check this list to see if your contribution needs to go through the product review process.
🔘 Provide contextTo help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:
🔘 Get a green buildIf one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green. 🔘 Update the status of your PRYour PR is currently marked as a draft. After completing the steps above, update its status by clicking "Ready for Review", or removing "WIP" from the title, as appropriate. 🔘 Let us know that your PR is ready for review:Who will review my changes?This repository is currently maintained by Where can I find more information?If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:
When can I expect my changes to be merged?Our goal is to get community contributions seen and reviewed as efficiently as possible. However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:
💡 As a result it may take up to several weeks or months to complete a review and merge your PR. |
|
||
.. code-block:: yaml | ||
|
||
TBD -- implement configurable queue for Aspects tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From comment:
It would be better to move the aspects-related tasks to the high queue, so it will perform better and doesn't block other LMS tasks.
@Ian2012 @bmtcril IIRR, to do this we need to do this for both the platform-plugin-aspects event sinks and event-routing-backends tasks tasks, correct?
- add an app setting to configure the default celery queue. Default to
HIGH_PRIORITY_QUEUE
. - pass this queue setting as to the
queue
parameter wherever the tasks are run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then create a new consumer for both services (lms, cms) named aspects-{{service}}-consumer
which are configured as any other lms-cms worker but that will only read that queue. However, I think the performance gains of batching are enough so that we don't need to this anymore. cc @bmtcril
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oo cool.. will await the results of your experiment, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to modify the default behaviorin ERB and I think we can enable it with the tutor plugin
|
||
.. code-block:: yaml | ||
|
||
TBD -- implement RALPH autoscaling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cf #81
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ian2012 @bmtcril Should the scaling parameters for Ralph, Vector, and Superset be added to tutor-contrib-aspects, or should we keep them in https://github.com/eduNEXT/tutor-contrib-pod-autoscaling ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values will be added via a filter in tutor-contrib-aspects
. This will be once the next named release is public and this PR is released: eduNEXT/tutor-contrib-pod-autoscaling#7
Small deployments can start with the following set up, and scale later: | ||
|
||
* 1 Clickhouse Keeper node, see `04-replication-zookeeper-01-minimal.yaml`_ | ||
* 1 Clickhouse node, see `03-persistent-volume-01-default-volume.yaml`_ | ||
|
||
For large deployments, we recommend: | ||
|
||
* 3 Clickhouse Keeper nodes to form the quorum: see `02-extended-3-nodes.yaml`_ | ||
* N Clickhouse nodes to perform the replication. | ||
|
||
If your k8s provider supports resizable volumes, see `03-persistent-volume-05-resizeable-volume-2.yaml`_ | ||
Otherwise, see `03-persistent-volume-02-pod-template.yaml`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be added to harmony, cf #80
|
||
.. code-block:: yaml | ||
|
||
TBD -- needs fixing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
.. code-block:: yaml | ||
|
||
TBD -- implement Superset autoscaling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cf #79
|
||
By default, the `Aspects Tutor plugin`_ deploys single nodes for these services: | ||
|
||
* :ref:`quick-start-ralph` or :ref:`quick-start-vector` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, running with Vector and not Ralph isn't working anymore:
INFO [alembic.runtime.migration] Running upgrade 0010 -> 0011
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/clickhouse_sqlalchemy/drivers/native/connector.py", line 152, in execute
response = execute(
^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/client.py", line 382, in execute
rv = self.process_ordinary_query(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/client.py", line 580, in process_ordinary_query
return self.receive_result(with_column_types=with_column_types,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/client.py", line 213, in receive_result
return result.get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/result.py", line 50, in get_result
for packet in self.packet_generator:
File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/client.py", line 229, in packet_generator
packet = self.receive_packet()
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/client.py", line 246, in receive_packet
raise packet.exception
clickhouse_driver.errors.ServerException: Code: 122.
DB::Exception: Tables have different structure.
I tried deploying with and without setting ASPECTS_XAPI_DATABASE: "openedx"
, and still got this error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try with alembic downgrade base
and then alembic upgrade head
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that @Ian2012 's suggestion will destroy all of your xAPI data, you'll either need to recreate new logs or replay the tracking log. I was definitely able to deploy with Vector last week without rebuilding everything, I just changed the vector database to xapi instead of the other way around. I haven't done a fresh build just for Vector recently, though. I'll put that on the list to try soon.
Scaling Clickhouse | ||
================== | ||
|
||
We recommend using a Clickhouse service provider to manage your production cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ian2012 @bmtcril I've been playing with deploying aspects using our Grove deployment system, and it was trivial to get it running with the default single-node Clickhouse deployment and persistent volume: https://superset.jill-aspects.staging.do.opencraft.hosting
I haven't tried using Harmony yet.
But if we're recommending using a hosted CH in production, do we need to add support for a scalable CH deployment to Harmony for Aspects v1? I'm happy to try, but I'm worried about maintaining something if no one is using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my opinion: I would add support for the ClickHouse operator and a multi-node ClickHouse installation with the needed configuration for Aspects to run. But let people use the operator as they need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a lot of folks prefer to run their own for cost or data privacy reasons so we should support scaling for it. I think we should be pretty up front about complicated scaling configurations being at your own risk since there's no way we can support them all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, that can be documented as part of #80.
|
||
And all production deployments will need a `persistent Clickhouse cluster`_. | ||
|
||
Preparing the LMS workers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: add info about configuring ERB batching
This is a great start, thanks so much for doing it. I spent last week load testing these pipelines and have some new numbers and opinions about when each is a good solution as well as what knobs to turn for improving performance in each one. I'm going to be presenting on that at data wg tomorrow, but will try to have the presentation ready to share tonight. |
@pomegranited just checking in on this to see if it's still in progress? |
@mphilbrick211 Yep sorry.. I'll wrap this up next sprint. |
Hi @mphilbrick211 , I wasn't able to get to this task this sprint, so closing this PR for now. I'll re-open when ready. |
@pomegranited Even though your pull request wasn’t merged, please take a moment to answer a two question survey so we can improve your experience in the future. |
This change adds basic documentation for scaling resources related to Aspects.
However, as several scaling-related features still need to be fixed or implemented, it's a work in progress.
Closes #78