Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adds scaling documentation #214

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/how-tos/clickhouse_cluster.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.. clickhouse-cluster:
.. _clickhouse-cluster:

How To Run Aspects With ClickHouse Cluster
******************************************
Expand Down
6 changes: 6 additions & 0 deletions docs/how-tos/event_bus.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
.. _event_bus:

Running the event bus
*********************

WIP -- https://github.com/openedx/tutor-contrib-aspects/pull/610
2 changes: 2 additions & 0 deletions docs/how-tos/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ How-Tos
:maxdepth: 2
:caption: Content:

Scaling Aspects <scaling>
Running the event bus <event_bus>
Upgrade Aspects <upgrade>
Changing the xAPI actor identifier <changing_actor_identifier>
Backfill old or missing data <backfill>
Expand Down
143 changes: 143 additions & 0 deletions docs/how-tos/scaling.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
.. _scaling:

Scaling your deployment
***********************

By default, the `Aspects Tutor plugin`_ deploys single nodes for these services:

* :ref:`quick-start-ralph` or :ref:`quick-start-vector`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, running with Vector and not Ralph isn't working anymore:

INFO  [alembic.runtime.migration] Running upgrade 0010 -> 0011
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/clickhouse_sqlalchemy/drivers/native/connector.py", line 152, in execute
    response = execute(
               ^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/client.py", line 382, in execute
    rv = self.process_ordinary_query(
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/client.py", line 580, in process_ordinary_query
    return self.receive_result(with_column_types=with_column_types,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/client.py", line 213, in receive_result
    return result.get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/result.py", line 50, in get_result
    for packet in self.packet_generator:
  File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/client.py", line 229, in packet_generator
    packet = self.receive_packet()
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/client.py", line 246, in receive_packet
    raise packet.exception
clickhouse_driver.errors.ServerException: Code: 122.
DB::Exception: Tables have different structure.

I tried deploying with and without setting ASPECTS_XAPI_DATABASE: "openedx", and still got this error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try with alembic downgrade base and then alembic upgrade head?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that @Ian2012 's suggestion will destroy all of your xAPI data, you'll either need to recreate new logs or replay the tracking log. I was definitely able to deploy with Vector last week without rebuilding everything, I just changed the vector database to xapi instead of the other way around. I haven't done a fresh build just for Vector recently, though. I'll put that on the list to try soon.

* Clickhouse: ephemeral single node cluster
* Superset: shares Open edX's MySQL and Redis instances

Most deployments will benefit greatly from scaling horizontally and/or vertically, especially when running Aspects. Aspects is fed by Open edX event data, which is triggered by user actions on the site, and Open edX can generate a lot of event data. The initial processing of this event data occurs in platform plugins, so scaling your LMS workers speeds this processing time. And the faster events can be processed, the quicker you will see them appear in Aspects' dashboards and charts.

And all production deployments will need a `persistent Clickhouse cluster`_.

Preparing the LMS workers
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add info about configuring ERB batching

=========================

Before deploying Aspects, we recommend :ref:`event_bus`.

Or, if you'd rather use celery, we recommend boosting the number of LMS workers to prepare the celery queue for the high-volume of Aspects tasks.

These plugins can be used to configure and enable autoscaling for the LMS and CMS. See their READMEs for details:

* `tutor-contrib-pod-autoscaling`_ : for single instance deployments
* `tutor-contrib-grove`_ : for multiple instance deployments

We also recommend configuring the LMS to use the high-priority celery queue for Aspects tasks (`platform-plugin-aspects event sinks`_, `event-routing-backends xAPI tasks`_). This leaves the low-priority queue clear for other LMS tasks.

.. code-block:: yaml

TBD -- implement configurable queue for Aspects tasks
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From comment:

It would be better to move the aspects-related tasks to the high queue, so it will perform better and doesn't block other LMS tasks.

@Ian2012 @bmtcril IIRR, to do this we need to do this for both the platform-plugin-aspects event sinks and event-routing-backends tasks tasks, correct?

  • add an app setting to configure the default celery queue. Default to HIGH_PRIORITY_QUEUE.
  • pass this queue setting as to the queue parameter wherever the tasks are run.

Copy link
Contributor

@Ian2012 Ian2012 Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then create a new consumer for both services (lms, cms) named aspects-{{service}}-consumer which are configured as any other lms-cms worker but that will only read that queue. However, I think the performance gains of batching are enough so that we don't need to this anymore. cc @bmtcril

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oo cool.. will await the results of your experiment, thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bmtcril @Ian2012 Should we enable batching by default, if it helps performance this much?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to modify the default behaviorin ERB and I think we can enable it with the tutor plugin


Scaling Ralph
=============

Without scaling, Ralph can be a bottleneck in the Aspects data pipeline. If you do not need an LRS at all, you can disable Ralph and use :ref:`quick-start-vector` instead.

But if you will ever need an LRS, it's better to start with Ralph enabled and autoscaled.

Ralph runs CPU-intensive operations, so we recommend scaling mainly on CPU:

.. code-block:: yaml

TBD -- implement RALPH autoscaling
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cf #81

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ian2012 @bmtcril Should the scaling parameters for Ralph, Vector, and Superset be added to tutor-contrib-aspects, or should we keep them in https://github.com/eduNEXT/tutor-contrib-pod-autoscaling ?

Copy link
Contributor

@Ian2012 Ian2012 Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values will be added via a filter in tutor-contrib-aspects. This will be once the next named release is public and this PR is released: eduNEXT/tutor-contrib-pod-autoscaling#7


.. _scaling-clickhouse:

Scaling Clickhouse
==================

We recommend using a Clickhouse service provider to manage your production cluster.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ian2012 @bmtcril I've been playing with deploying aspects using our Grove deployment system, and it was trivial to get it running with the default single-node Clickhouse deployment and persistent volume: https://superset.jill-aspects.staging.do.opencraft.hosting

I haven't tried using Harmony yet.

But if we're recommending using a hosted CH in production, do we need to add support for a scalable CH deployment to Harmony for Aspects v1? I'm happy to try, but I'm worried about maintaining something if no one is using it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my opinion: I would add support for the ClickHouse operator and a multi-node ClickHouse installation with the needed configuration for Aspects to run. But let people use the operator as they need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a lot of folks prefer to run their own for cost or data privacy reasons so we should support scaling for it. I think we should be pretty up front about complicated scaling configurations being at your own risk since there's no way we can support them all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, that can be documented as part of #80.


Aspects `avoids using experimental Clickhouse features`_, and so is suitable for use with cloud providers. Cloud hosting also provides support, automated backups, and autoscaling. See :ref:`clickhouse-cluster` for details.

However, if you decide to run your own Clickhouse instance, you will need to take into account:

* horizontal and vertical scaling
* replication and quorum
* data storage requirements over time

References:

* `Clickhouse Operator`_: Helm charts, docs and examples from Altinity
* `Clickhouse Keeper`_: recommended replication setup

Small deployments can start with the following set up, and scale later:

* 1 Clickhouse Keeper node, see `04-replication-zookeeper-01-minimal.yaml`_
* 1 Clickhouse node, see `03-persistent-volume-01-default-volume.yaml`_

For large deployments, we recommend:

* 3 Clickhouse Keeper nodes to form the quorum: see `02-extended-3-nodes.yaml`_
* N Clickhouse nodes to perform the replication.

If your k8s provider supports resizable volumes, see `03-persistent-volume-05-resizeable-volume-2.yaml`_
Otherwise, see `03-persistent-volume-02-pod-template.yaml`_
Comment on lines +67 to +78
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be added to harmony, cf #80


Scaling Superset
================

By default, Aspects configures Superset to share these resources with Open edX:

* mysql
* redis

However, if it becomes too resource intensive, these services can be replaced with separate standalone services.

.. code-block:: yaml

TBD -- needs fixing
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



.. note::

Ensure ``sql_require_primary_key = off`` for your MySQL server.

Some of the Superset tables are initially created without primary keys, so if this flag is set, these migrations fail.

Some hosting providers `like DigitalOcean`_ enable this flag by default.


Superset should also be configured to autoscale based on CPU and RAM. Use a similar configuration as your CMS:

.. code-block:: yaml

TBD -- implement Superset autoscaling
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cf #79



Superset also supports these scaling features, which may be supported by future versions of Aspects.

* `asynchronous queries`_: configure the database assets to enable "Asynchronous query execution mode", which moves query execution to the celery workers.
This is useful for queries thtat run beyond a typical web request's timeout (30-60 seconds).
* cache warming: schedule tasks to use the `Superset API`_ to pre-fetch data into the caches.
This is useful for frequently-accessed datasets or charts.

References:

* https://www.restack.io/docs/superset-on-kubernetes
* https://medium.com/airbnb-engineering/supercharging-apache-superset-b1a2393278bd
* https://preset.io/blog/2020-08-11-nielsen-superset/
* https://flask.palletsprojects.com/en/1.1.x/becomingbig/


.. _Aspects Tutor plugin: https://github.com/openedx/tutor-contrib-aspects
.. _tutor-contrib-pod-autoscaling: https://github.com/eduNEXT/tutor-contrib-pod-autoscaling
.. _tutor-contrib-grove: https://gitlab.com/opencraft/dev/tutor-contrib-grove
.. _platform-plugin-aspects event sinks: https://github.com/openedx/platform-plugin-aspects/blob/main/platform_plugin_aspects/tasks.py
.. _event-routing-backends xAPI tasks: https://github.com/openedx/event-routing-backends/blob/master/event_routing_backends/tasks.py
.. _persistent Clickhouse cluster: #scaling-clickhouse
.. _Clickhouse cloud: https://clickhouse.com/cloud
.. _avoids using experimental Clickhouse features: ../decisions/0013_clickhouse_experimental.html
.. _Clickhouse Operator: https://github.com/Altinity/clickhouse-operator
.. _Clickhouse Keeper: https://github.com/Altinity/clickhouse-operator/blob/master/docs/zookeeper_setup.md
.. _04-replication-zookeeper-01-minimal.yaml: https://github.com/Altinity/clickhouse-operator/blob/master/docs/chi-examples/04-replication-zookeeper-01-minimal.yaml
.. _03-persistent-volume-01-default-volume.yaml: https://github.com/Altinity/clickhouse-operator/blob/master/docs/chi-examples/03-persistent-volume-01-default-volume.yaml
.. _02-extended-3-nodes.yaml: https://github.com/Altinity/clickhouse-operator/blob/master/docs/chk-examples/02-extended-3-nodes.yaml
.. _03-persistent-volume-05-resizeable-volume-2.yaml: https://github.com/Altinity/clickhouse-operator/blob/master/docs/chi-examples/03-persistent-volume-05-resizeable-volume-2.yaml
.. _03-persistent-volume-02-pod-template.yaml: https://github.com/Altinity/clickhouse-operator/blob/master/docs/chi-examples/03-persistent-volume-02-pod-template.yaml
.. _like DigitalOcean: https://www.digitalocean.com/community/questions/how-to-disable-sql_require_primary_key-in-digital-ocean-manged-database-for-mysql
.. _asynchronous queries: https://superset.apache.org/docs/installation/async-queries-celery/
.. _Superset API: https://superset.apache.org/docs/api/