-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for continuously starting load jobs as slots free up in the loader #1494
base: devel
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
@@ -371,6 +397,18 @@ def complete_package(self, load_id: str, schema: Schema, aborted: bool = False) | |||
f"All jobs completed, archiving package {load_id} with aborted set to {aborted}" | |||
) | |||
|
|||
def update_loadpackage_info(self, load_id: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure when this needs to be run and how often, so any help there would be apreciated
if ( | ||
len(self.load_storage.list_new_jobs(load_id)) == 0 | ||
and len(self.load_storage.normalized_packages.list_started_jobs(load_id)) == 0 | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the correct "package completion" condition? I think so, but am not 100% sure.
tests/load/test_dummy_client.py
Outdated
@@ -96,15 +96,15 @@ def test_unsupported_write_disposition() -> None: | |||
load.load_storage.normalized_packages.save_schema(load_id, schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to change a bunch of tests, since we do not rely on multiple executions of the run method anymore. All the changes make sense, it might be good to add a few more tests cases specific to the new implementation.
# we continously spool new jobs and complete finished ones | ||
running_jobs, pending_exception = self.complete_jobs(load_id, running_jobs, schema) | ||
# do not spool new jobs if there was a signal | ||
if not signals.signal_received() and not pending_exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
signal support still here :)
remaining_jobs: List[LoadJob] = [] | ||
# if an exception condition was met, return it to the main runner | ||
pending_exception: Exception = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to collect the exception to raise in the main loop here now, we could alternatively collect all problems we find and print them out, not just raise on exception.
Note: PR on hold until naming conventions PR is merged. |
# Conflicts: # dlt/load/load.py # tests/load/test_dummy_client.py
c265ecc
to
b4d05c8
Compare
0a9b5c3
to
da8c9e6
Compare
# Conflicts: # dlt/destinations/impl/lancedb/lancedb_client.py # dlt/destinations/impl/qdrant/qdrant_client.py # dlt/load/load.py
assumes followup jobs can always be created without error
# Conflicts: # dlt/destinations/impl/snowflake/snowflake.py
mark load job vars private
Description
In the current implementation we more or less always start n (=max workers) load jobs, let them complete and then rerun the whole loader to schedule the next n jobs. In this PR we submit new load jobs as slots free up.
What is happening here:
TODO: