Add `upsert` merge strategy #1466

jorritsandbrink · 2024-06-14T16:28:24Z

Description

adds a new merge strategy called upsert
requires primary_key, should be unique (see note 1)
- stores primary-key hash as _dlt_id
- does not support primary key-based deduplication like the delete-insert strategy
does not support merge_key
only tested for postgres and snowflake destinations (other destinations come in follow up PR)

Note 1:

A primary key-based _dlt_id requires that the primary key is unique before going into the normalize step. Primary key-based deduplication in the load step, as we do in the delete-insert merge strategy, is not possible.

Problems when primary key is not unique (i.e. it has duplicates):

database error because UNIQUE constraint on _dlt_id column is violated (can be solved by not imposing the constraint)
foreign keys in child staging tables link to multiple records in root staging table ➜ can't deduplicate child tables

Related Issues

Closes Add upsert merge strategy #1129

netlify · 2024-06-14T16:28:41Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`e5e68f2`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/668d6a7cf4040f0008b4a4b5
😎 Deploy Preview	https://deploy-preview-1466--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

jorritsandbrink · 2024-06-15T14:22:04Z

dlt/common/destination/reference.py

+                            " dlt will fall back to `append` for this table."
+                        )
+                elif table.get("x-merge-strategy") == "upsert":
+                    if self.config.destination_name not in ("postgres", "snowflake"):


Should we add a supported_merge_strategies destination capability?

yes - if you at some point (as early as possible ie. when passing data to normalizer when we must have destination capabilities) are able to issue a warning and say which strategy will be used instead.

same thing with replace strategies (we have 3 afaik)

also if you want to do it: we do not even store which write dispositions are supported. so maybe this is a separate ticket :)

added supported_merge_strategies capability and configured it for all SQL destinations

added _verify_destination_capabilities that checks configured merge strategy against supported merge strategies

raises error if not supported

(I know you suggested falling back to a supported strategy and issueing a warning instead, but I'm not a fan of that approach. Will of course change it if that's how we do things in dlt, but wanted to challenge it first.)

called right before normalize step — is that the right place?

I can create a separate ticket for supported replace strategies and supported write dispositions capabilities.

jorritsandbrink · 2024-06-15T14:39:34Z

dlt/destinations/sql_jobs.py

+        # generate statements for child tables if they exist
+        child_tables = table_chain[1:]
+        if child_tables:
+            root_row_key = escape_id("_dlt_id")


Should the row id be marked with a new hint (other than unique) to prevent hard-coding _dlt_id?

why not unique? technically you just need a unique column to delete / merge child tables correctly. delete-insert does that, right?

You need the root key to delete from child tables, because you check if the root key is present in the staging root table (lines 606 and 623 in sql_jobs.py).

If I understand correctly, the root key is always _dlt_id, not an arbitrary unique column.

delete-insert indeed uses the unique column hint instead of hard-coding _dlt_id, but that seems wrong to me.

Do we have a test case for the merge disposition with an arbitrary unique key? Maybe the tests pass because in practice the unique hint always resolves to _dlt_id in all our cases.

no, we do not have such test case. what I want to achieve here is that we rely on annotations, not on column names.

_dlt_id is annotated as unique by a standard schema

root_key is a also an annotation, not a hardcoded field
relational.py adds config for each table with merge write disposition:

"propagation": { "tables": { table_name: { TColumnName(self.c_dlt_id): TColumnName(self.c_dlt_root_id) } } }

which will propagate _dlt_id from root to each child table as _dlt_root_id and

self.schema._merge_hints( { "not_null": [ TSimpleRegex(self.c_dlt_id), TSimpleRegex(self.c_dlt_root_id), TSimpleRegex(self.c_dlt_parent_id), TSimpleRegex(self.c_dlt_list_idx), TSimpleRegex(self.c_dlt_load_id), ], "foreign_key": [TSimpleRegex(self.c_dlt_parent_id)], "root_key": [TSimpleRegex(self.c_dlt_root_id)], "unique": [TSimpleRegex(self.c_dlt_id)], }, normalize_identifiers=False, # already normalized )

makes sure that hints are applied. this is quite old version of dlt and we'll reimplement it. but high level mechanism (we rely on annotations, not names) is good

rudolfix

this looks good, is short and clean. important info

I must merge allows naming conventions to be changed #998 - there all column names are abstracted away etc. your PR will generate a lot of conflict and it will be way easier to merge it after that
for later: it would be cool if users can request dlt id type for append/replace as well - ie to just use content or primary key (deterministic) hash
what happens if input data is not deduplicated? so we have several values with the same primary key? delete-insert is deduplicating input data. see what happens in your case

rudolfix · 2024-06-16T20:12:15Z

dlt/common/destination/reference.py

+                            " dlt will fall back to `append` for this table."
+                        )
+                elif table.get("x-merge-strategy") == "upsert":
+                    if self.config.destination_name not in ("postgres", "snowflake"):


yes - if you at some point (as early as possible ie. when passing data to normalizer when we must have destination capabilities) are able to issue a warning and say which strategy will be used instead.

same thing with replace strategies (we have 3 afaik)

also if you want to do it: we do not even store which write dispositions are supported. so maybe this is a separate ticket :)

rudolfix · 2024-06-16T20:35:30Z

dlt/common/normalizers/json/relational.py

-        if row_hash:
-            row_id = self.get_row_hash(dict_row)  # type: ignore[arg-type]
-            dict_row["_dlt_id"] = row_id
+        if row_id_type in ("key_hash", "row_hash"):


this logic must be moved to _add_row_id. I think the way we handle _dlt_id needs a little bit more work

we have super simple bring your own _dlt_id

row_id = flattened_row.get("_dlt_id", None) if not row_id: row_id = self._add_row_id(table, flattened_row, parent_row_id, pos, _r_lvl)

we should replace it with hint based method - if there's any unique column we use it for _dlt_id. that may be a separate ticket. but current "bring your own" must work and now you ignore it here

we have many methods to generate _dlt_id for parent table: random, deterministic (primary key based) and content based (used by scd2). they way you do it now is good enough. in the future I'd like people to be able to pick the way dlt id is generated (ie via ItemsNormalizerConfiguration)

for child tables we always have deterministic _dlt_id

I have moved the logic to _add_row_id. Current "bring your own" (if I understand it correctly) should work now.

dlt/common/normalizers/json/relational.py

dlt/common/normalizers/typing.py

rudolfix · 2024-06-16T20:55:12Z

dlt/destinations/sql_jobs.py

+        # generate statements for child tables if they exist
+        child_tables = table_chain[1:]
+        if child_tables:
+            root_row_key = escape_id("_dlt_id")


why not unique? technically you just need a unique column to delete / merge child tables correctly. delete-insert does that, right?

…com/dlt-hub/dlt into feat/1129-add-upsert-merge-strategy

…mplement any of the defined merge strategies

jorritsandbrink · 2024-06-20T10:11:24Z

@rudolfix I addressed your comments. Can you review?

what happens if input data is not deduplicated? so we have several values with the same primary key? delete-insert is deduplicating input data. see what happens in your case

See "Note 1" in the PR description.

…-add-upsert-merge-strategy

jorritsandbrink · 2024-06-28T16:25:38Z

tests/load/pipeline/test_merge_disposition.py

+    assert "primary_key" in r._hints
+    assert "merge_key" in r._hints
+    p.run(r())
+    assert (


This assertion fails on CI because capsys.readouterr().err is an empty string there. It does work on my local machine. Any idea on how to best test warning logs?

jorritsandbrink · 2024-06-28T16:34:20Z

dlt/destinations/utils.py

+                            " merge strategy.",
+                        )
+                    )
+                if has_column_with_prop(table, "merge_key"):


I noticed these warnings are logged more than once because verify_sql_job_client_schema is called multiple times in a single pipeline run. Do we have something to prevent that?

jorritsandbrink · 2024-06-28T16:36:17Z

dlt/extract/hints.py

+            dict_["x-merge-strategy"] = DEFAULT_MERGE_STRATEGY
+            if "strategy" in mddict:
+                if mddict["strategy"] not in MERGE_STRATEGIES:
+                    raise ValueError(


Is this the right way/place to do user input validation?

…-add-upsert-merge-strategy

add upsert merge strategy

f275530

jorritsandbrink self-assigned this Jun 14, 2024

jorritsandbrink linked an issue Jun 14, 2024 that may be closed by this pull request

Add upsert merge strategy #1129

Open

handle destination upsert support

ad21a9f

jorritsandbrink commented Jun 15, 2024

View reviewed changes

jorritsandbrink marked this pull request as ready for review June 15, 2024 19:21

jorritsandbrink requested a review from rudolfix June 15, 2024 19:21

rudolfix requested changes Jun 16, 2024

View reviewed changes

jorritsandbrink mentioned this pull request Jun 17, 2024

Add upsert merge strategy #1129

Open

rudolfix mentioned this pull request Jun 17, 2024

0.5.1 announcement and release notes #1486

Open

jorritsandbrink and others added 10 commits June 18, 2024 18:18

refactor row id type handling

ebc79e1

black format

9efb502

Merge branch 'devel' into feat/1129-add-upsert-merge-strategy

64d0b17

add supported_merge_strategies destination capability

999da5e

fix child row id type handling

19230a8

Merge branch 'feat/1129-add-upsert-merge-strategy' of https://github.…

34ff5e6

…com/dlt-hub/dlt into feat/1129-add-upsert-merge-strategy

add condition to exclude destinations that handle merge, but do not i…

c9a02aa

…mplement any of the defined merge strategies

repair improper merge conflict resolution

7635095

improve merge strategy validation

28e6fb7

add default merge strategy to dummy destination capabilities

51ab4ac

jorritsandbrink requested a review from rudolfix June 20, 2024 10:09

rudolfix added the sprint Marks group of tasks with core team focus at this moment label Jun 26, 2024

jorritsandbrink added 5 commits June 28, 2024 13:59

Merge branch 'devel' of https://github.com/dlt-hub/dlt into feat/1129…

ced879a

…-add-upsert-merge-strategy

re-add merge strategies capability

5191718

re-add upsert schema verification

0606dfd

black format

2647517

change SchemaException to ValueError

01202d4

jorritsandbrink added 9 commits June 28, 2024 14:55

test upsert merge key warning log

e4c856b

Merge branch 'devel' of https://github.com/dlt-hub/dlt into feat/1129…

68615dd

…-add-upsert-merge-strategy

remove obsolete property

f53656c

remove obsolete import

4e2988d

use get_qualified_table_names utility function consistently

00e9e25

remove obsolete import

c8c2431

move test because it needs postgres credentials

b80699a

correct supported merge strategies

5000f43

move dremio supported merge strategies

cc7d114

jorritsandbrink commented Jun 28, 2024

View reviewed changes

rudolfix removed the sprint Marks group of tasks with core team focus at this moment label Jul 3, 2024

jorritsandbrink mentioned this pull request Jul 7, 2024

Add upsert merge strategy #1294

Closed

jorritsandbrink added 2 commits July 9, 2024 15:09

Merge branch 'devel' of https://github.com/dlt-hub/dlt into feat/1129…

caaf248

…-add-upsert-merge-strategy

remove hardcoded row id column references

e5e68f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `upsert` merge strategy #1466

Add `upsert` merge strategy #1466

jorritsandbrink commented Jun 14, 2024 •

edited

Loading

netlify bot commented Jun 14, 2024 •

edited

Loading

jorritsandbrink Jun 15, 2024

rudolfix Jun 16, 2024

jorritsandbrink Jun 20, 2024

jorritsandbrink Jun 15, 2024

rudolfix Jun 16, 2024

jorritsandbrink Jun 17, 2024

rudolfix Jul 9, 2024

rudolfix left a comment

rudolfix Jun 16, 2024

rudolfix Jun 16, 2024

jorritsandbrink Jun 20, 2024

rudolfix Jun 16, 2024

jorritsandbrink commented Jun 20, 2024

jorritsandbrink Jun 28, 2024

jorritsandbrink Jun 28, 2024

jorritsandbrink Jun 28, 2024

Add upsert merge strategy #1466

Are you sure you want to change the base?

Add upsert merge strategy #1466

Conversation

jorritsandbrink commented Jun 14, 2024 • edited Loading

Description

Related Issues

netlify bot commented Jun 14, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs ready!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorritsandbrink commented Jun 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add `upsert` merge strategy #1466

Add `upsert` merge strategy #1466

jorritsandbrink commented Jun 14, 2024 •

edited

Loading

netlify bot commented Jun 14, 2024 •

edited

Loading