Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add upsert merge strategy #1466

Open
wants to merge 28 commits into
base: devel
Choose a base branch
from

Conversation

jorritsandbrink
Copy link
Collaborator

@jorritsandbrink jorritsandbrink commented Jun 14, 2024

Description

  • adds a new merge strategy called upsert
  • requires primary_key, should be unique (see note 1)
    • stores primary-key hash as _dlt_id
    • does not support primary key-based deduplication like the delete-insert strategy
  • does not support merge_key
  • only tested for postgres and snowflake destinations (other destinations come in follow up PR)

Note 1:

A primary key-based _dlt_id requires that the primary key is unique before going into the normalize step. Primary key-based deduplication in the load step, as we do in the delete-insert merge strategy, is not possible.

Problems when primary key is not unique (i.e. it has duplicates):

  • database error because UNIQUE constraint on _dlt_id column is violated (can be solved by not imposing the constraint)
  • foreign keys in child staging tables link to multiple records in root staging table ➜ can't deduplicate child tables

Related Issues

@jorritsandbrink jorritsandbrink self-assigned this Jun 14, 2024
@jorritsandbrink jorritsandbrink linked an issue Jun 14, 2024 that may be closed by this pull request
Copy link

netlify bot commented Jun 14, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit e5e68f2
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/668d6a7cf4040f0008b4a4b5
😎 Deploy Preview https://deploy-preview-1466--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

" dlt will fall back to `append` for this table."
)
elif table.get("x-merge-strategy") == "upsert":
if self.config.destination_name not in ("postgres", "snowflake"):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a supported_merge_strategies destination capability?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - if you at some point (as early as possible ie. when passing data to normalizer when we must have destination capabilities) are able to issue a warning and say which strategy will be used instead.

same thing with replace strategies (we have 3 afaik)

also if you want to do it: we do not even store which write dispositions are supported. so maybe this is a separate ticket :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • added supported_merge_strategies capability and configured it for all SQL destinations
  • added _verify_destination_capabilities that checks configured merge strategy against supported merge strategies
    • raises error if not supported
      • (I know you suggested falling back to a supported strategy and issueing a warning instead, but I'm not a fan of that approach. Will of course change it if that's how we do things in dlt, but wanted to challenge it first.)
    • called right before normalize step — is that the right place?

I can create a separate ticket for supported replace strategies and supported write dispositions capabilities.

# generate statements for child tables if they exist
child_tables = table_chain[1:]
if child_tables:
root_row_key = escape_id("_dlt_id")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the row id be marked with a new hint (other than unique) to prevent hard-coding _dlt_id?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not unique? technically you just need a unique column to delete / merge child tables correctly. delete-insert does that, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need the root key to delete from child tables, because you check if the root key is present in the staging root table (lines 606 and 623 in sql_jobs.py).

If I understand correctly, the root key is always _dlt_id, not an arbitrary unique column.

delete-insert indeed uses the unique column hint instead of hard-coding _dlt_id, but that seems wrong to me.

Do we have a test case for the merge disposition with an arbitrary unique key? Maybe the tests pass because in practice the unique hint always resolves to _dlt_id in all our cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we do not have such test case. what I want to achieve here is that we rely on annotations, not on column names.

  • _dlt_id is annotated as unique by a standard schema
  • root_key is a also an annotation, not a hardcoded field
    relational.py adds config for each table with merge write disposition:
"propagation": {
                        "tables": {
                            table_name: {
                                TColumnName(self.c_dlt_id): TColumnName(self.c_dlt_root_id)
                            }
                        }
                    }

which will propagate _dlt_id from root to each child table as _dlt_root_id and

self.schema._merge_hints(
            {
                "not_null": [
                    TSimpleRegex(self.c_dlt_id),
                    TSimpleRegex(self.c_dlt_root_id),
                    TSimpleRegex(self.c_dlt_parent_id),
                    TSimpleRegex(self.c_dlt_list_idx),
                    TSimpleRegex(self.c_dlt_load_id),
                ],
                "foreign_key": [TSimpleRegex(self.c_dlt_parent_id)],
                "root_key": [TSimpleRegex(self.c_dlt_root_id)],
                "unique": [TSimpleRegex(self.c_dlt_id)],
            },
            normalize_identifiers=False,  # already normalized
        )

makes sure that hints are applied. this is quite old version of dlt and we'll reimplement it. but high level mechanism (we rely on annotations, not names) is good

@jorritsandbrink jorritsandbrink marked this pull request as ready for review June 15, 2024 19:21
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good, is short and clean. important info

  • I must merge allows naming conventions to be changed #998 - there all column names are abstracted away etc. your PR will generate a lot of conflict and it will be way easier to merge it after that
  • for later: it would be cool if users can request dlt id type for append/replace as well - ie to just use content or primary key (deterministic) hash
  • what happens if input data is not deduplicated? so we have several values with the same primary key? delete-insert is deduplicating input data. see what happens in your case

" dlt will fall back to `append` for this table."
)
elif table.get("x-merge-strategy") == "upsert":
if self.config.destination_name not in ("postgres", "snowflake"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - if you at some point (as early as possible ie. when passing data to normalizer when we must have destination capabilities) are able to issue a warning and say which strategy will be used instead.

same thing with replace strategies (we have 3 afaik)

also if you want to do it: we do not even store which write dispositions are supported. so maybe this is a separate ticket :)

if row_hash:
row_id = self.get_row_hash(dict_row) # type: ignore[arg-type]
dict_row["_dlt_id"] = row_id
if row_id_type in ("key_hash", "row_hash"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic must be moved to _add_row_id. I think the way we handle _dlt_id needs a little bit more work

  • we have super simple bring your own _dlt_id
row_id = flattened_row.get("_dlt_id", None)
        if not row_id:
            row_id = self._add_row_id(table, flattened_row, parent_row_id, pos, _r_lvl)

we should replace it with hint based method - if there's any unique column we use it for _dlt_id. that may be a separate ticket. but current "bring your own" must work and now you ignore it here

  • we have many methods to generate _dlt_id for parent table: random, deterministic (primary key based) and content based (used by scd2). they way you do it now is good enough. in the future I'd like people to be able to pick the way dlt id is generated (ie via ItemsNormalizerConfiguration)
  • for child tables we always have deterministic _dlt_id

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved the logic to _add_row_id. Current "bring your own" (if I understand it correctly) should work now.

dlt/common/normalizers/json/relational.py Outdated Show resolved Hide resolved
dlt/common/normalizers/typing.py Show resolved Hide resolved
# generate statements for child tables if they exist
child_tables = table_chain[1:]
if child_tables:
root_row_key = escape_id("_dlt_id")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not unique? technically you just need a unique column to delete / merge child tables correctly. delete-insert does that, right?

@jorritsandbrink
Copy link
Collaborator Author

@rudolfix I addressed your comments. Can you review?

what happens if input data is not deduplicated? so we have several values with the same primary key? delete-insert is deduplicating input data. see what happens in your case

See "Note 1" in the PR description.

@rudolfix rudolfix added the sprint Marks group of tasks with core team focus at this moment label Jun 26, 2024
assert "primary_key" in r._hints
assert "merge_key" in r._hints
p.run(r())
assert (
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion fails on CI because capsys.readouterr().err is an empty string there. It does work on my local machine. Any idea on how to best test warning logs?

" merge strategy.",
)
)
if has_column_with_prop(table, "merge_key"):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed these warnings are logged more than once because verify_sql_job_client_schema is called multiple times in a single pipeline run. Do we have something to prevent that?

dict_["x-merge-strategy"] = DEFAULT_MERGE_STRATEGY
if "strategy" in mddict:
if mddict["strategy"] not in MERGE_STRATEGIES:
raise ValueError(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right way/place to do user input validation?

@rudolfix rudolfix removed the sprint Marks group of tasks with core team focus at this moment label Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Add upsert merge strategy
2 participants