Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for continuously starting load jobs as slots free up in the loader #1494

Open
wants to merge 37 commits into
base: devel
Choose a base branch
from

Conversation

sh-rp
Copy link
Collaborator

@sh-rp sh-rp commented Jun 19, 2024

Description

In the current implementation we more or less always start n (=max workers) load jobs, let them complete and then rerun the whole loader to schedule the next n jobs. In this PR we submit new load jobs as slots free up.

What is happening here:

  • The loader now periodically checks all jobs wether they are done and schedules new ones as needed.
  • Runnable Jobs now manage their own internal state (file moving still needs to be done by the loader on the main thread) and have a dedicated "run" method which is the one called on a thread.
  • Renaming of the Job base classes to make clearer what is going on:
    • RunnableLoadJob: Jobs that actually do something and should be executed on a thread
    • FollowupJob: Class that creates a new job persisted to the disc
    • HasFollowupJobs (ex-FollowupJob): Trait that tells loader to look for followup Jobs
    • FinalizedJob: Not runnable because it already has an actionable state (completed, failed, retry). Used for indicating failed restored jobs, completed restored jobs and cases where nothing needs to be done
  • FollowupJobs always go to the "new_jobs" folder and need to be picked up by the loader as any other jobs do, there were some just directly executed in the main thread: not good! :) We now assume that creation of FollowupJobs does not yield exceptions, I don't think this is needed and it makes the code simpler.
  • (Hopefully) simplification of the load class
  • Restoring jobs is now handled the same way as creating jobs is, this simplifies the code in a few places.

TODO:

  • Check all jobs that you can actually can be resumed because they run remotely (I think big query has this) if they properly work with the new setup,.
  • More tests for the dummy loader to see that everything works as expected
  • Fix delta reference job

Copy link

netlify bot commented Jun 19, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 18fbca2
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/668db6de1d9e710008c160e7

@@ -371,6 +397,18 @@ def complete_package(self, load_id: str, schema: Schema, aborted: bool = False)
f"All jobs completed, archiving package {load_id} with aborted set to {aborted}"
)

def update_loadpackage_info(self, load_id: str) -> None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure when this needs to be run and how often, so any help there would be apreciated

if (
len(self.load_storage.list_new_jobs(load_id)) == 0
and len(self.load_storage.normalized_packages.list_started_jobs(load_id)) == 0
):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the correct "package completion" condition? I think so, but am not 100% sure.

@@ -96,15 +96,15 @@ def test_unsupported_write_disposition() -> None:
load.load_storage.normalized_packages.save_schema(load_id, schema)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to change a bunch of tests, since we do not rely on multiple executions of the run method anymore. All the changes make sense, it might be good to add a few more tests cases specific to the new implementation.

@sh-rp sh-rp marked this pull request as ready for review June 19, 2024 14:57
# we continously spool new jobs and complete finished ones
running_jobs, pending_exception = self.complete_jobs(load_id, running_jobs, schema)
# do not spool new jobs if there was a signal
if not signals.signal_received() and not pending_exception:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

signal support still here :)

remaining_jobs: List[LoadJob] = []
# if an exception condition was met, return it to the main runner
pending_exception: Exception = None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to collect the exception to raise in the main loop here now, we could alternatively collect all problems we find and print them out, not just raise on exception.

@sh-rp
Copy link
Collaborator Author

sh-rp commented Jun 20, 2024

Note: PR on hold until naming conventions PR is merged.

@sh-rp sh-rp added the enhancement New feature or request label Jun 20, 2024
@rudolfix rudolfix added the sprint Marks group of tasks with core team focus at this moment label Jun 26, 2024
# Conflicts:
#	dlt/load/load.py
#	tests/load/test_dummy_client.py
@sh-rp sh-rp force-pushed the feat/continuous-load-jobs branch from c265ecc to b4d05c8 Compare June 27, 2024 13:31
@sh-rp sh-rp force-pushed the feat/continuous-load-jobs branch from 0a9b5c3 to da8c9e6 Compare July 2, 2024 10:45
@rudolfix rudolfix removed the sprint Marks group of tasks with core team focus at this moment label Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants