-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: adds scaling documentation #214
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
.. _event_bus: | ||
|
||
Running the event bus | ||
********************* | ||
|
||
WIP -- https://github.com/openedx/tutor-contrib-aspects/pull/610 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,143 @@ | ||
.. _scaling: | ||
|
||
Scaling your deployment | ||
*********************** | ||
|
||
By default, the `Aspects Tutor plugin`_ deploys single nodes for these services: | ||
|
||
* :ref:`quick-start-ralph` or :ref:`quick-start-vector` | ||
* Clickhouse: ephemeral single node cluster | ||
* Superset: shares Open edX's MySQL and Redis instances | ||
|
||
Most deployments will benefit greatly from scaling horizontally and/or vertically, especially when running Aspects. Aspects is fed by Open edX event data, which is triggered by user actions on the site, and Open edX can generate a lot of event data. The initial processing of this event data occurs in platform plugins, so scaling your LMS workers speeds this processing time. And the faster events can be processed, the quicker you will see them appear in Aspects' dashboards and charts. | ||
|
||
And all production deployments will need a `persistent Clickhouse cluster`_. | ||
|
||
Preparing the LMS workers | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO: add info about configuring ERB batching |
||
========================= | ||
|
||
Before deploying Aspects, we recommend :ref:`event_bus`. | ||
|
||
Or, if you'd rather use celery, we recommend boosting the number of LMS workers to prepare the celery queue for the high-volume of Aspects tasks. | ||
|
||
These plugins can be used to configure and enable autoscaling for the LMS and CMS. See their READMEs for details: | ||
|
||
* `tutor-contrib-pod-autoscaling`_ : for single instance deployments | ||
* `tutor-contrib-grove`_ : for multiple instance deployments | ||
|
||
We also recommend configuring the LMS to use the high-priority celery queue for Aspects tasks (`platform-plugin-aspects event sinks`_, `event-routing-backends xAPI tasks`_). This leaves the low-priority queue clear for other LMS tasks. | ||
|
||
.. code-block:: yaml | ||
|
||
TBD -- implement configurable queue for Aspects tasks | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From comment:
@Ian2012 @bmtcril IIRR, to do this we need to do this for both the platform-plugin-aspects event sinks and event-routing-backends tasks tasks, correct?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And then create a new consumer for both services (lms, cms) named There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oo cool.. will await the results of your experiment, thank you! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't want to modify the default behaviorin ERB and I think we can enable it with the tutor plugin |
||
|
||
Scaling Ralph | ||
============= | ||
|
||
Without scaling, Ralph can be a bottleneck in the Aspects data pipeline. If you do not need an LRS at all, you can disable Ralph and use :ref:`quick-start-vector` instead. | ||
|
||
But if you will ever need an LRS, it's better to start with Ralph enabled and autoscaled. | ||
|
||
Ralph runs CPU-intensive operations, so we recommend scaling mainly on CPU: | ||
|
||
.. code-block:: yaml | ||
|
||
TBD -- implement RALPH autoscaling | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cf #81 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Ian2012 @bmtcril Should the scaling parameters for Ralph, Vector, and Superset be added to tutor-contrib-aspects, or should we keep them in https://github.com/eduNEXT/tutor-contrib-pod-autoscaling ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The values will be added via a filter in |
||
|
||
.. _scaling-clickhouse: | ||
|
||
Scaling Clickhouse | ||
================== | ||
|
||
We recommend using a Clickhouse service provider to manage your production cluster. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Ian2012 @bmtcril I've been playing with deploying aspects using our Grove deployment system, and it was trivial to get it running with the default single-node Clickhouse deployment and persistent volume: https://superset.jill-aspects.staging.do.opencraft.hosting I haven't tried using Harmony yet. But if we're recommending using a hosted CH in production, do we need to add support for a scalable CH deployment to Harmony for Aspects v1? I'm happy to try, but I'm worried about maintaining something if no one is using it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is my opinion: I would add support for the ClickHouse operator and a multi-node ClickHouse installation with the needed configuration for Aspects to run. But let people use the operator as they need. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, a lot of folks prefer to run their own for cost or data privacy reasons so we should support scaling for it. I think we should be pretty up front about complicated scaling configurations being at your own risk since there's no way we can support them all. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cool, that can be documented as part of #80. |
||
|
||
Aspects `avoids using experimental Clickhouse features`_, and so is suitable for use with cloud providers. Cloud hosting also provides support, automated backups, and autoscaling. See :ref:`clickhouse-cluster` for details. | ||
|
||
However, if you decide to run your own Clickhouse instance, you will need to take into account: | ||
|
||
* horizontal and vertical scaling | ||
* replication and quorum | ||
* data storage requirements over time | ||
|
||
References: | ||
|
||
* `Clickhouse Operator`_: Helm charts, docs and examples from Altinity | ||
* `Clickhouse Keeper`_: recommended replication setup | ||
|
||
Small deployments can start with the following set up, and scale later: | ||
|
||
* 1 Clickhouse Keeper node, see `04-replication-zookeeper-01-minimal.yaml`_ | ||
* 1 Clickhouse node, see `03-persistent-volume-01-default-volume.yaml`_ | ||
|
||
For large deployments, we recommend: | ||
|
||
* 3 Clickhouse Keeper nodes to form the quorum: see `02-extended-3-nodes.yaml`_ | ||
* N Clickhouse nodes to perform the replication. | ||
|
||
If your k8s provider supports resizable volumes, see `03-persistent-volume-05-resizeable-volume-2.yaml`_ | ||
Otherwise, see `03-persistent-volume-02-pod-template.yaml`_ | ||
Comment on lines
+67
to
+78
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Needs to be added to harmony, cf #80 |
||
|
||
Scaling Superset | ||
================ | ||
|
||
By default, Aspects configures Superset to share these resources with Open edX: | ||
|
||
* mysql | ||
* redis | ||
|
||
However, if it becomes too resource intensive, these services can be replaced with separate standalone services. | ||
|
||
.. code-block:: yaml | ||
|
||
TBD -- needs fixing | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
|
||
.. note:: | ||
|
||
Ensure ``sql_require_primary_key = off`` for your MySQL server. | ||
|
||
Some of the Superset tables are initially created without primary keys, so if this flag is set, these migrations fail. | ||
|
||
Some hosting providers `like DigitalOcean`_ enable this flag by default. | ||
|
||
|
||
Superset should also be configured to autoscale based on CPU and RAM. Use a similar configuration as your CMS: | ||
|
||
.. code-block:: yaml | ||
|
||
TBD -- implement Superset autoscaling | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cf #79 |
||
|
||
|
||
Superset also supports these scaling features, which may be supported by future versions of Aspects. | ||
|
||
* `asynchronous queries`_: configure the database assets to enable "Asynchronous query execution mode", which moves query execution to the celery workers. | ||
This is useful for queries thtat run beyond a typical web request's timeout (30-60 seconds). | ||
* cache warming: schedule tasks to use the `Superset API`_ to pre-fetch data into the caches. | ||
This is useful for frequently-accessed datasets or charts. | ||
|
||
References: | ||
|
||
* https://www.restack.io/docs/superset-on-kubernetes | ||
* https://medium.com/airbnb-engineering/supercharging-apache-superset-b1a2393278bd | ||
* https://preset.io/blog/2020-08-11-nielsen-superset/ | ||
* https://flask.palletsprojects.com/en/1.1.x/becomingbig/ | ||
|
||
|
||
.. _Aspects Tutor plugin: https://github.com/openedx/tutor-contrib-aspects | ||
.. _tutor-contrib-pod-autoscaling: https://github.com/eduNEXT/tutor-contrib-pod-autoscaling | ||
.. _tutor-contrib-grove: https://gitlab.com/opencraft/dev/tutor-contrib-grove | ||
.. _platform-plugin-aspects event sinks: https://github.com/openedx/platform-plugin-aspects/blob/main/platform_plugin_aspects/tasks.py | ||
.. _event-routing-backends xAPI tasks: https://github.com/openedx/event-routing-backends/blob/master/event_routing_backends/tasks.py | ||
.. _persistent Clickhouse cluster: #scaling-clickhouse | ||
.. _Clickhouse cloud: https://clickhouse.com/cloud | ||
.. _avoids using experimental Clickhouse features: ../decisions/0013_clickhouse_experimental.html | ||
.. _Clickhouse Operator: https://github.com/Altinity/clickhouse-operator | ||
.. _Clickhouse Keeper: https://github.com/Altinity/clickhouse-operator/blob/master/docs/zookeeper_setup.md | ||
.. _04-replication-zookeeper-01-minimal.yaml: https://github.com/Altinity/clickhouse-operator/blob/master/docs/chi-examples/04-replication-zookeeper-01-minimal.yaml | ||
.. _03-persistent-volume-01-default-volume.yaml: https://github.com/Altinity/clickhouse-operator/blob/master/docs/chi-examples/03-persistent-volume-01-default-volume.yaml | ||
.. _02-extended-3-nodes.yaml: https://github.com/Altinity/clickhouse-operator/blob/master/docs/chk-examples/02-extended-3-nodes.yaml | ||
.. _03-persistent-volume-05-resizeable-volume-2.yaml: https://github.com/Altinity/clickhouse-operator/blob/master/docs/chi-examples/03-persistent-volume-05-resizeable-volume-2.yaml | ||
.. _03-persistent-volume-02-pod-template.yaml: https://github.com/Altinity/clickhouse-operator/blob/master/docs/chi-examples/03-persistent-volume-02-pod-template.yaml | ||
.. _like DigitalOcean: https://www.digitalocean.com/community/questions/how-to-disable-sql_require_primary_key-in-digital-ocean-manged-database-for-mysql | ||
.. _asynchronous queries: https://superset.apache.org/docs/installation/async-queries-celery/ | ||
.. _Superset API: https://superset.apache.org/docs/api/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, running with Vector and not Ralph isn't working anymore:
I tried deploying with and without setting
ASPECTS_XAPI_DATABASE: "openedx"
, and still got this error.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try with
alembic downgrade base
and thenalembic upgrade head
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that @Ian2012 's suggestion will destroy all of your xAPI data, you'll either need to recreate new logs or replay the tracking log. I was definitely able to deploy with Vector last week without rebuilding everything, I just changed the vector database to xapi instead of the other way around. I haven't done a fresh build just for Vector recently, though. I'll put that on the list to try soon.