Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adds scaling documentation #214

Closed
wants to merge 2 commits into from
Closed

Conversation

pomegranited
Copy link
Contributor

This change adds basic documentation for scaling resources related to Aspects.
However, as several scaling-related features still need to be fixed or implemented, it's a work in progress.

Closes #78

with placeholder document for configuring the event bus.
@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Mar 28, 2024
@openedx-webhooks
Copy link

openedx-webhooks commented Mar 28, 2024

Thanks for the pull request, @pomegranited!

What's next?

Please work through the following steps to get your changes ready for engineering review:

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

  • If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
    • This process (including the steps you'll need to take) is documented here.
  • If it doesn't, simply proceed with the next step.

🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

  • Dependencies

    This PR must be merged before / after / at the same time as ...

  • Blockers

    This PR is waiting for OEP-1234 to be accepted.

  • Timeline information

    This PR must be merged by XX date because ...

  • Partner information

    This is for a course on edx.org.

  • Supporting documentation
  • Relevant Open edX discussion forum threads

🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

🔘 Update the status of your PR

Your PR is currently marked as a draft. After completing the steps above, update its status by clicking "Ready for Review", or removing "WIP" from the title, as appropriate.

🔘 Let us know that your PR is ready for review:

Who will review my changes?

This repository is currently maintained by @bmtcril. Tag them in a comment and let them know that your changes are ready for review.

Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

  • The size and impact of the changes that it introduces
  • The need for product review
  • Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.


.. code-block:: yaml

TBD -- implement configurable queue for Aspects tasks
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From comment:

It would be better to move the aspects-related tasks to the high queue, so it will perform better and doesn't block other LMS tasks.

@Ian2012 @bmtcril IIRR, to do this we need to do this for both the platform-plugin-aspects event sinks and event-routing-backends tasks tasks, correct?

  • add an app setting to configure the default celery queue. Default to HIGH_PRIORITY_QUEUE.
  • pass this queue setting as to the queue parameter wherever the tasks are run.

Copy link
Contributor

@Ian2012 Ian2012 Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then create a new consumer for both services (lms, cms) named aspects-{{service}}-consumer which are configured as any other lms-cms worker but that will only read that queue. However, I think the performance gains of batching are enough so that we don't need to this anymore. cc @bmtcril

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oo cool.. will await the results of your experiment, thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bmtcril @Ian2012 Should we enable batching by default, if it helps performance this much?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to modify the default behaviorin ERB and I think we can enable it with the tutor plugin


.. code-block:: yaml

TBD -- implement RALPH autoscaling
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cf #81

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ian2012 @bmtcril Should the scaling parameters for Ralph, Vector, and Superset be added to tutor-contrib-aspects, or should we keep them in https://github.com/eduNEXT/tutor-contrib-pod-autoscaling ?

Copy link
Contributor

@Ian2012 Ian2012 Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values will be added via a filter in tutor-contrib-aspects. This will be once the next named release is public and this PR is released: eduNEXT/tutor-contrib-pod-autoscaling#7

Comment on lines +67 to +78
Small deployments can start with the following set up, and scale later:

* 1 Clickhouse Keeper node, see `04-replication-zookeeper-01-minimal.yaml`_
* 1 Clickhouse node, see `03-persistent-volume-01-default-volume.yaml`_

For large deployments, we recommend:

* 3 Clickhouse Keeper nodes to form the quorum: see `02-extended-3-nodes.yaml`_
* N Clickhouse nodes to perform the replication.

If your k8s provider supports resizable volumes, see `03-persistent-volume-05-resizeable-volume-2.yaml`_
Otherwise, see `03-persistent-volume-02-pod-template.yaml`_
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be added to harmony, cf #80


.. code-block:: yaml

TBD -- needs fixing
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


.. code-block:: yaml

TBD -- implement Superset autoscaling
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cf #79


By default, the `Aspects Tutor plugin`_ deploys single nodes for these services:

* :ref:`quick-start-ralph` or :ref:`quick-start-vector`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, running with Vector and not Ralph isn't working anymore:

INFO  [alembic.runtime.migration] Running upgrade 0010 -> 0011
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/clickhouse_sqlalchemy/drivers/native/connector.py", line 152, in execute
    response = execute(
               ^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/client.py", line 382, in execute
    rv = self.process_ordinary_query(
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/client.py", line 580, in process_ordinary_query
    return self.receive_result(with_column_types=with_column_types,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/client.py", line 213, in receive_result
    return result.get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/result.py", line 50, in get_result
    for packet in self.packet_generator:
  File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/client.py", line 229, in packet_generator
    packet = self.receive_packet()
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/client.py", line 246, in receive_packet
    raise packet.exception
clickhouse_driver.errors.ServerException: Code: 122.
DB::Exception: Tables have different structure.

I tried deploying with and without setting ASPECTS_XAPI_DATABASE: "openedx", and still got this error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try with alembic downgrade base and then alembic upgrade head?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that @Ian2012 's suggestion will destroy all of your xAPI data, you'll either need to recreate new logs or replay the tracking log. I was definitely able to deploy with Vector last week without rebuilding everything, I just changed the vector database to xapi instead of the other way around. I haven't done a fresh build just for Vector recently, though. I'll put that on the list to try soon.

Scaling Clickhouse
==================

We recommend using a Clickhouse service provider to manage your production cluster.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ian2012 @bmtcril I've been playing with deploying aspects using our Grove deployment system, and it was trivial to get it running with the default single-node Clickhouse deployment and persistent volume: https://superset.jill-aspects.staging.do.opencraft.hosting

I haven't tried using Harmony yet.

But if we're recommending using a hosted CH in production, do we need to add support for a scalable CH deployment to Harmony for Aspects v1? I'm happy to try, but I'm worried about maintaining something if no one is using it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my opinion: I would add support for the ClickHouse operator and a multi-node ClickHouse installation with the needed configuration for Aspects to run. But let people use the operator as they need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a lot of folks prefer to run their own for cost or data privacy reasons so we should support scaling for it. I think we should be pretty up front about complicated scaling configurations being at your own risk since there's no way we can support them all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, that can be documented as part of #80.


And all production deployments will need a `persistent Clickhouse cluster`_.

Preparing the LMS workers
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add info about configuring ERB batching

@bmtcril
Copy link
Contributor

bmtcril commented Apr 2, 2024

This is a great start, thanks so much for doing it. I spent last week load testing these pipelines and have some new numbers and opinions about when each is a good solution as well as what knobs to turn for improving performance in each one. I'm going to be presenting on that at data wg tomorrow, but will try to have the presentation ready to share tonight.

@mphilbrick211
Copy link

@pomegranited just checking in on this to see if it's still in progress?

@pomegranited
Copy link
Contributor Author

@mphilbrick211 Yep sorry.. I'll wrap this up next sprint.

@pomegranited
Copy link
Contributor Author

Hi @mphilbrick211 , I wasn't able to get to this task this sprint, so closing this PR for now. I'll re-open when ready.

@openedx-webhooks
Copy link

@pomegranited Even though your pull request wasn’t merged, please take a moment to answer a two question survey so we can improve your experience in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open-source-contribution PR author is not from Axim or 2U
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Docs: How-to - Scaling
5 participants