Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

creates a single source in extract for all resource instances passed as list #1535

Open
wants to merge 1 commit into
base: devel
Choose a base branch
from

Conversation

rudolfix
Copy link
Collaborator

@rudolfix rudolfix commented Jul 2, 2024

Description

We discovered peculiar problem with rest_api when users were passing a list of resources to the run function that contained a resource and a transformer:

     page = 1

    @dlt.resource(name="pages")
    def gen_pages():
        nonlocal page
        while True:
            yield {"page": page}
            if page == 10:
                return
            page += 1

    @dlt.transformer(name="subpages")
    def get_subpages(page_item):
        yield from [
            {
                "page": page_item["page"],
                "subpage": subpage,
            }
            for subpage in range(1, 11)
        ]

    pipeline = dlt.pipeline("test_resource_transformer_standalone", destination="duckdb")
    # here we must combine resources and transformers using the same instance
    info = pipeline.run([gen_pages, gen_pages | get_subpages])

in the case above only last page is passed to the transformer (see the commits for tests and details)
the root cause is that each resource in the list was packaged in a separate source and extracted separately. that prevented any DAG optimizations and gen_pages was extracted twice.

here we change the behavior where a single dlt source is used to extract all the resources in the list

@rudolfix rudolfix self-assigned this Jul 2, 2024
Copy link

netlify bot commented Jul 2, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit d77cb1a
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66843bf04aaa4e00089f41bf
😎 Deploy Preview https://deploy-preview-1535--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@rudolfix rudolfix requested a review from burnash July 3, 2024 08:49
@rudolfix rudolfix added the sprint Marks group of tasks with core team focus at this moment label Jul 3, 2024
Copy link
Collaborator

@sh-rp sh-rp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, i was going to suggest doing this for another reason anyway: parallelism will not work on a list of resources in the current version, now it will :)

PS: Maybe a small note about this behavior is warranted in the docs, I'm not sure.

@rudolfix
Copy link
Collaborator Author

rudolfix commented Jul 3, 2024

@sh-rp I'll add this to our release notes as one of the changes

@rudolfix
Copy link
Collaborator Author

rudolfix commented Jul 3, 2024

@sh-rp also you are partially right with the parallelism! we have tests that are passing a list of many resources with the same names. and those tests are failing. We'd need to package them in separate sources and execute them one by one

@rudolfix rudolfix removed the sprint Marks group of tasks with core team focus at this moment label Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

None yet

2 participants