Add support for continuously starting load jobs as slots free up in the loader #1494

sh-rp · 2024-06-19T14:54:47Z

Description

In the current implementation we more or less always start n (=max workers) load jobs, let them complete and then rerun the whole loader to schedule the next n jobs. In this PR we submit new load jobs as slots free up.

What is happening here:

The loader now periodically checks all jobs wether they are done and schedules new ones as needed.
Runnable Jobs now manage their own internal state (file moving still needs to be done by the loader on the main thread) and have a dedicated "run" method which is the one called on a thread.
Renaming of the Job base classes to make clearer what is going on:
- RunnableLoadJob: Jobs that actually do something and should be executed on a thread
- FollowupJob: Class that creates a new job persisted to the disc
- HasFollowupJobs (ex-FollowupJob): Trait that tells loader to look for followup Jobs
- FinalizedJob: Not runnable because it already has an actionable state (completed, failed, retry). Used for indicating failed restored jobs, completed restored jobs and cases where nothing needs to be done
FollowupJobs always go to the "new_jobs" folder and need to be picked up by the loader as any other jobs do, there were some just directly executed in the main thread: not good! :) We now assume that creation of FollowupJobs does not yield exceptions, I don't think this is needed and it makes the code simpler.
(Hopefully) simplification of the load class
Restoring jobs is now handled the same way as creating jobs is, this simplifies the code in a few places.

TODO:

Check all jobs that you can actually can be resumed because they run remotely (I think big query has this) if they properly work with the new setup,.
More tests for the dummy loader to see that everything works as expected
Fix delta reference job

netlify · 2024-06-19T14:55:02Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`18fbca2`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/668db6de1d9e710008c160e7

sh-rp · 2024-06-19T14:55:32Z

dlt/load/load.py

@@ -371,6 +397,18 @@ def complete_package(self, load_id: str, schema: Schema, aborted: bool = False)
            f"All jobs completed, archiving package {load_id} with aborted set to {aborted}"
        )

+    def update_loadpackage_info(self, load_id: str) -> None:


I am not quite sure when this needs to be run and how often, so any help there would be apreciated

sh-rp · 2024-06-19T14:56:12Z

dlt/load/load.py

+        if (
+            len(self.load_storage.list_new_jobs(load_id)) == 0
+            and len(self.load_storage.normalized_packages.list_started_jobs(load_id)) == 0
+        ):


is this the correct "package completion" condition? I think so, but am not 100% sure.

sh-rp · 2024-06-19T14:57:17Z

tests/load/test_dummy_client.py

@@ -96,15 +96,15 @@ def test_unsupported_write_disposition() -> None:
    load.load_storage.normalized_packages.save_schema(load_id, schema)


I needed to change a bunch of tests, since we do not rely on multiple executions of the run method anymore. All the changes make sense, it might be good to add a few more tests cases specific to the new implementation.

sh-rp · 2024-06-19T14:58:09Z

dlt/load/load.py

+                # we continously spool new jobs and complete finished ones
+                running_jobs, pending_exception = self.complete_jobs(load_id, running_jobs, schema)
+                # do not spool new jobs if there was a signal
+                if not signals.signal_received() and not pending_exception:


signal support still here :)

sh-rp · 2024-06-19T15:01:50Z

dlt/load/load.py

        remaining_jobs: List[LoadJob] = []
+        # if an exception condition was met, return it to the main runner
+        pending_exception: Exception = None


I need to collect the exception to raise in the main loop here now, we could alternatively collect all problems we find and print them out, not just raise on exception.

sh-rp · 2024-06-20T13:44:04Z

Note: PR on hold until naming conventions PR is merged.

# Conflicts: # dlt/load/load.py # tests/load/test_dummy_client.py

fix some tests

# Conflicts: # dlt/destinations/impl/lancedb/lancedb_client.py # dlt/destinations/impl/qdrant/qdrant_client.py # dlt/load/load.py

assumes followup jobs can always be created without error

# Conflicts: # dlt/destinations/impl/snowflake/snowflake.py

mark load job vars private

add support for starting load jobs as slots free up

78a5989

sh-rp commented Jun 19, 2024

View reviewed changes

sh-rp marked this pull request as ready for review June 19, 2024 14:57

sh-rp commented Jun 19, 2024

View reviewed changes

sh-rp added the enhancement New feature or request label Jun 20, 2024

rudolfix added the sprint Marks group of tasks with core team focus at this moment label Jun 26, 2024

Merge branch 'devel' into feat/continuous-load-jobs

b4d05c8

# Conflicts: # dlt/load/load.py # tests/load/test_dummy_client.py

sh-rp force-pushed the feat/continuous-load-jobs branch from c265ecc to b4d05c8 Compare June 27, 2024 13:31

sh-rp added 3 commits July 2, 2024 11:14

Merge branch 'devel' into feat/continuous-load-jobs

8f1c9bc

update loader class to devel changes

c516fbc

update failed w_d test

da8c9e6

sh-rp force-pushed the feat/continuous-load-jobs branch from 0a9b5c3 to da8c9e6 Compare July 2, 2024 10:45

sh-rp added 6 commits July 2, 2024 13:37

reduce sleep time for now

b8ff71d

add first implementation of futures on custom destination

fa66386

rename start_file_load to get_load_job

d59e4eb

add first version of working follow up jobs for new loader setup

3a8ec86

require jobclient in constructor for duckdb

1768e17

fixes some dummy tests

1707413

rudolfix removed the sprint Marks group of tasks with core team focus at this moment label Jul 3, 2024

sh-rp added 6 commits July 3, 2024 14:38

update all jobs to have the new run method

189988c

unify file_path argument in loadjobs

a53a9b7

fixes some filepath related tests

37108a6

renames job classes for more clarity and small updates

aaa14fe

re-organize jobs a bit more

78f5dbc

fix some tests

fix destination parallelism

a8d4a7a

sh-rp added 20 commits July 4, 2024 11:30

remove changed in config.toml

2d1c3b0

replace emptyloadjob with finalized load job

c93fea8

make sure files are only moved on main thread

9c4ee47

Merge branch 'devel' into feat/continuous-load-jobs

331d74a

tmp

2f6d3db

wrap job instantiation in try catch block (still needs improvement)

f61151a

Merge branch 'devel' into feat/continuous-load-jobs

145dbfb

# Conflicts: # dlt/destinations/impl/lancedb/lancedb_client.py # dlt/destinations/impl/qdrant/qdrant_client.py # dlt/load/load.py

post devel merge fix

3765b01

simplify followupjob creation

4d05dd5

assumes followup jobs can always be created without error

refactor job restoring

5ddb8ed

Merge branch 'devel' into feat/continuous-load-jobs

14794e7

# Conflicts: # dlt/destinations/impl/snowflake/snowflake.py

simplify common fields on loadjobs

efb21b1

mark load job vars private

Merge branch 'devel' into feat/continuous-load-jobs

75bbb59

completely separate followupjobs from regular loadjobs

1f857a0

unify some more loadjob vars

d6ad935

fix job client tests

d6d2dc7

amend last commit

1a5d2de

fix handling of jobs in loader

58ae445

fix a couple more tests

802b168

fix deltalake load jobs

18fbca2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for continuously starting load jobs as slots free up in the loader #1494

Add support for continuously starting load jobs as slots free up in the loader #1494

sh-rp commented Jun 19, 2024 •

edited

Loading

netlify bot commented Jun 19, 2024 •

edited

Loading

sh-rp Jun 19, 2024

sh-rp Jun 19, 2024

sh-rp Jun 19, 2024

sh-rp Jun 19, 2024

sh-rp Jun 19, 2024

sh-rp commented Jun 20, 2024

		@@ -96,15 +96,15 @@ def test_unsupported_write_disposition() -> None:
		load.load_storage.normalized_packages.save_schema(load_id, schema)

Add support for continuously starting load jobs as slots free up in the loader #1494

Are you sure you want to change the base?

Add support for continuously starting load jobs as slots free up in the loader #1494

Conversation

sh-rp commented Jun 19, 2024 • edited Loading

Description

netlify bot commented Jun 19, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

sh-rp Jun 19, 2024

Choose a reason for hiding this comment

sh-rp Jun 19, 2024

Choose a reason for hiding this comment

sh-rp Jun 19, 2024

Choose a reason for hiding this comment

sh-rp Jun 19, 2024

Choose a reason for hiding this comment

sh-rp Jun 19, 2024

Choose a reason for hiding this comment

sh-rp commented Jun 20, 2024

sh-rp commented Jun 19, 2024 •

edited

Loading

netlify bot commented Jun 19, 2024 •

edited

Loading