[FLINK-35500][Connectors/DynamoDB] DynamoDb Table API Sink fails to delete elements due to key not found #152

robg-eb · 2024-07-24T20:16:50Z

Purpose of the change

When DynamoDbSink is used with CDC sources, it fails to process DELETE records and throws

    org.apache.flink.connector.dynamodb.shaded.software.amazon.awssdk.services.dynamodb.model.DynamoDbException: The provided key element does not match the schema

This is due to DynamoDbSinkWriter passing the whole DynamoDb Item as key instead of the constructed primary key alone.

Verifying this change

This change added tests and can be verified as follows:

Added tests to RowDataToAttributeValueConverterTest.java:
- testDeleteOnlyPrimaryKey - Ensures that for a DELETE request, only the (single) PK field is included
- testDeleteOnlyPrimaryKeys - Ensures that for a DELETE request with a composite PK, both PK fields are included.
- testPKIgnoredForInsert - Ensures that PK is ignored when an INSERT request is done, and all fields continue to be included as they have been in the past.
- testPKIgnoredForUpdateAfter - Ensures that PK is ignored when an UPDATE_AFTER request is done, and all fields continue to be included as they have been in the past.
Ran manual tests following the steps noted in https://issues.apache.org/jira/browse/FLINK-35500 under "Steps To Reproduce". Running the SQL statement as described in Step 6 now properly runs a DELETE in DynamoDB.

Significant changes

Previously, the PRIMARY KEY field had no significance for a DynamoDB Sink via Table API. Now, the PRIMARY KEY is required when processing a CDC stream that contains DELETES. This is not a 'breaking change' because the previous behavior for processing a CDC stream containing DELETES was already a failure (The provided key element does not match the schema). This change now provides a clear exception informing users to specify a Primary Key to avoid that failure. To clarify this change, the PR contains updates to the Connector documentation.

…elete elements due to key not found

boring-cyborg · 2024-07-24T20:16:54Z

Thanks for opening this pull request! Please check out our contributing guidelines. (https://flink.apache.org/contributing/how-to-contribute.html)

vahmed-hamdy · 2024-07-28T14:43:58Z

...est/java/org/apache/flink/connector/dynamodb/table/RowDataToAttributeValueConverterTest.java

+        Map<String, AttributeValue> expectedResult =
+                singletonMap(key, AttributeValue.builder().s(value).build());
+
+        assertThat(actualResult).containsAllEntriesOf(expectedResult);


nit: could be done on one assert using containsExactly

See response below

vahmed-hamdy · 2024-07-28T14:44:14Z

...est/java/org/apache/flink/connector/dynamodb/table/RowDataToAttributeValueConverterTest.java

+        expectedResult.put(key, AttributeValue.builder().s(value).build());
+        expectedResult.put(additionalKey, AttributeValue.builder().s(additionalValue).build());
+
+        assertThat(actualResult).containsAllEntriesOf(expectedResult);


nit: could be done on one assert using containsExactly

See response below

vahmed-hamdy · 2024-07-28T14:44:20Z

...est/java/org/apache/flink/connector/dynamodb/table/RowDataToAttributeValueConverterTest.java

+        expectedResult.put(key, AttributeValue.builder().s(value).build());
+        expectedResult.put(otherField, AttributeValue.builder().s(otherValue).build());
+
+        assertThat(actualResult).containsAllEntriesOf(expectedResult);


nit: could be done on one assert using containsExactly

@vahmed-hamdy - I took a look at doing this, but containsExactly only works for (ordered) Lists, whereas the things we're comparing here are actually HashMap . We'd have to convert these HashMap and impose an order on them as lists to compare them, which would seem to be more complicated than what I have here in just running two separate assertions - would you agree ? Let me know if you think there's a better way - I'm admittedly not a Java expert..

You can use assertThat(...).containsExactlyInAnyOrderEntriesOf(...) here to check that maps contain same elements without worrying about order.

...est/java/org/apache/flink/connector/dynamodb/table/RowDataToAttributeValueConverterTest.java

...rc/main/java/org/apache/flink/connector/dynamodb/table/RowDataToAttributeValueConverter.java

vahmed-hamdy · 2024-07-28T15:04:16Z

There are some style violations, Could you run mvn spotless:apply ahead of the PR please?

robg-eb · 2024-08-01T14:41:58Z

@vahmed-hamdy -

There are some style violations, Could you run mvn spotless:apply ahead of the PR please?

Done, I've applied updated formatting now!

z3d1k

Thank you for the contribution!
Posted a few notes on code, but overall implementation looks good

@hlteoh37 please take a look

...rc/main/java/org/apache/flink/connector/dynamodb/table/RowDataToAttributeValueConverter.java

...odb/src/test/java/org/apache/flink/connector/dynamodb/table/RowDataElementConverterTest.java

...est/java/org/apache/flink/connector/dynamodb/table/RowDataToAttributeValueConverterTest.java

z3d1k · 2024-08-09T10:49:23Z

...est/java/org/apache/flink/connector/dynamodb/table/RowDataToAttributeValueConverterTest.java

+        expectedResult.put(key, AttributeValue.builder().s(value).build());
+        expectedResult.put(otherField, AttributeValue.builder().s(otherValue).build());
+
+        assertThat(actualResult).containsAllEntriesOf(expectedResult);


You can use assertThat(...).containsExactlyInAnyOrderEntriesOf(...) here to check that maps contain same elements without worrying about order.

...est/java/org/apache/flink/connector/dynamodb/table/RowDataToAttributeValueConverterTest.java

vahmed-hamdy · 2024-08-12T18:57:24Z

Thanks @robg-eb ,
@hlteoh37 could you please take a look

robg-eb · 2024-08-13T21:27:52Z

I addressed all the open comments from @z3d1k now. Thank you all for the suggestions! Sounds like @hlteoh37 is the next who should take a look?

dzikosc · 2024-09-12T18:21:38Z

...or-dynamodb/src/main/java/org/apache/flink/connector/dynamodb/table/DynamoDbDynamicSink.java

@@ -62,7 +63,8 @@ protected DynamoDbDynamicSink(
            boolean failOnError,
            Properties dynamoDbClientProperties,
            DataType physicalDataType,
-            Set<String> overwriteByPartitionKeys) {
+            Set<String> overwriteByPartitionKeys,
+            Set<String> primaryKeys) {


In DynamoDB nomenclature, this set of properties is a primaryKey (not primaryKeys). Let's use a proper name

To be clear, you're just suggesting changing the variable name from primaryKeys to primaryKey, right? And leaving its data type intact of Set<String>, since a Primary Key can consist of two fields - the Partition Key and the Sort Key?

dzikosc · 2024-09-12T18:23:58Z

...modb/src/main/java/org/apache/flink/connector/dynamodb/table/DynamoDbDynamicSinkFactory.java

+        if (catalogTable.getResolvedSchema().getPrimaryKey().isPresent()) {
+            builder =
+                    builder.setPrimaryKeys(
+                            new HashSet<>(


DynamoDB primary key is an ordered set of properties - partitionKey and then sortKey. Let's model it like that.

I would suggest using those as dedicated properties instead in the Sink model. We could add an extra validation here to ensure consistency with DDB schema.

if one column provided that's Partition Key

if two columns provided, first is Partition Key, the 2nd is a sortKey

if three or more columns are specified, table registration should fail

+1
It must be an ordered set/list of 1 or 2 elements.

Suggestion: if not too complicated, for a user perspective it would be clearer a no-nonsense passing a PrimaryKey object with two fields, partitionKey and sortKey, where partitionKey

@nicusX - While having a dedicated PrimaryKey object with two fields partitionKey and sortKey might work for the DataStream API if I were to add it to the Sink model, I am not clear on how that would then translate to the Table API. I was hoping to just use the fact that the Table API / SQL API already supports the concept of passing in a PRIMARY KEY for that.

I don't think there's any reason here why we would need to separate the partition key and sort key for the purpose here of identifying the Primary Key - and in fact, there is also some naming collision then as well as the current Table API / SQL connector supports passing in a PARTITIONED BY clause, adding to potential confusion.

Ack for not having an object with separate fields.
However, in this case the primaryKey (singular) should be an ordered set (e.g. a List) and not a Set. The order of the two fields is relevant: the first always being the partitionKey, in DDB parlance, and the second, if present, the sortKey

dzikosc

Let's model the concept of DDB primary key, using the specification and correct naming

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.CoreComponents.html#HowItWorks.CoreComponents.PrimaryKey

dzikosc · 2024-09-12T18:39:03Z

...modb/src/main/java/org/apache/flink/connector/dynamodb/table/DynamoDbDynamicSinkFactory.java

@@ -58,6 +58,17 @@ public DynamicTableSink createDynamicTableSink(Context context) {
                        .setDynamoDbClientProperties(
                                dynamoDbConfiguration.getSinkClientProperties());

+        if (catalogTable.getResolvedSchema().getPrimaryKey().isPresent()) {


IN case of DynamoDB contents of PrimaryKey and PARTITION BY should be the same to ensure semantic correctness of various operations.

To avoid repetititiveness and simplify the usage, we should aim to use just PrimaryKey property, as it's a fairly intuitive and well understood concept for DDB users. In the meantime, we should probably aim to start deprecation process of the partition by mechanism.

The first step could be

if PK is set, but partition by is not, use PK as partition by

document that partition by is on deprecation path

log warnings when partition by is being used

+1
Also, I'd add a check that, if PARTITION BY and PRIMARY KEY are both specified, they must be identical

@dzikosc - Isn't there a potential use case wherein a user of the existing PARTITION BY functionality of the sink wants to use it to deduplicate data by something other than the actual Primary Key of the DynamoDB table? For example, the DynamoDB table has:

user_id Partition Key order_id Sort Key line_number price

Today, the user of the connector could specify a PARTITION BY user_id, order_id, line_number to deduplicate incoming data - Maybe this is a contrived case, but just pointing out that we may now be blocking a use case that was previously supported. Thoughts?

Either way, could we keep the deprecation of PARTITION BY to a separate PR to reduce the scope of this one?

dzikosc · 2024-09-12T18:42:09Z

docs/content/docs/connectors/table/dynamodb.md

+  `item_id` BIGINT,
+  `category_id` BIGINT,
+  `behavior` STRING,
+  PRIMARY KEY (user_id) NOT ENFORCED


Why do we use NOT ENFORCED qualifier? In case of DynamoDB Primary Key property (or properties) are always present on the record.

nicusX · 2024-09-14T06:22:45Z

docs/content/docs/connectors/table/dynamodb.md

+);
+```
+
+Note that this Primary Key functionality, specified by `PRIMARY KEY`, can be used alongside the Sink Partitioning mentioned above via `PARTITIONED BY` to dedeuplicate data and support DELETEs.


I would clarify that PARTITIONED BY is used by Flink for deduplication within the same batch (and to support delete), but it's different from DynamoDB's partitionKey. If the user is familiar with DynamoDB but not much with Flink this may generate lot of confusion.

Also, PARTITIONED BY and PRIMARY KEY should always be the same.
(see other comment by @dzikosc below)

nicusX · 2024-09-14T06:23:10Z

...or-dynamodb/src/main/java/org/apache/flink/connector/dynamodb/table/DynamoDbDynamicSink.java

@@ -62,7 +63,8 @@ protected DynamoDbDynamicSink(
            boolean failOnError,
            Properties dynamoDbClientProperties,
            DataType physicalDataType,
-            Set<String> overwriteByPartitionKeys) {
+            Set<String> overwriteByPartitionKeys,
+            Set<String> primaryKeys) {


nicusX · 2024-09-14T06:33:05Z

...modb/src/main/java/org/apache/flink/connector/dynamodb/table/DynamoDbDynamicSinkFactory.java

@@ -58,6 +58,17 @@ public DynamicTableSink createDynamicTableSink(Context context) {
                        .setDynamoDbClientProperties(
                                dynamoDbConfiguration.getSinkClientProperties());

+        if (catalogTable.getResolvedSchema().getPrimaryKey().isPresent()) {


+1
Also, I'd add a check that, if PARTITION BY and PRIMARY KEY are both specified, they must be identical

nicusX · 2024-09-14T06:49:48Z

...modb/src/main/java/org/apache/flink/connector/dynamodb/table/DynamoDbDynamicSinkFactory.java

+        if (catalogTable.getResolvedSchema().getPrimaryKey().isPresent()) {
+            builder =
+                    builder.setPrimaryKeys(
+                            new HashSet<>(


+1
It must be an ordered set/list of 1 or 2 elements.

Suggestion: if not too complicated, for a user perspective it would be clearer a no-nonsense passing a PrimaryKey object with two fields, partitionKey and sortKey, where partitionKey

hlteoh37 · 2024-11-04T09:53:29Z

@robg-eb Thank you for working on this. Can I check if you plan to address the remaining comments?

[FLINK-35500][Connectors/DynamoDB] DynamoDb Table API Sink fails to d…

07e5746

…elete elements due to key not found

Add documenation updates

68f1253

boring-cyborg bot added the component=Documentation label Jul 24, 2024

vahmed-hamdy suggested changes Jul 28, 2024

View reviewed changes

Apply spotless formatting updates

aa030f3

robg-eb added 2 commits August 1, 2024 11:12

Use overriden constructor, add test for exception

5372665

formatting

cdfcd05

z3d1k suggested changes Aug 9, 2024

View reviewed changes

Updates as per PR feedback

621df92

dzikosc reviewed Sep 12, 2024

View reviewed changes

nicusX reviewed Sep 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-35500][Connectors/DynamoDB] DynamoDb Table API Sink fails to delete elements due to key not found #152

[FLINK-35500][Connectors/DynamoDB] DynamoDb Table API Sink fails to delete elements due to key not found #152

robg-eb commented Jul 24, 2024 •

edited

Loading

boring-cyborg bot commented Jul 24, 2024

vahmed-hamdy Jul 28, 2024

robg-eb Aug 1, 2024

vahmed-hamdy Jul 28, 2024

robg-eb Aug 1, 2024

vahmed-hamdy Jul 28, 2024

robg-eb Aug 1, 2024

z3d1k Aug 9, 2024

vahmed-hamdy commented Jul 28, 2024

robg-eb commented Aug 1, 2024

z3d1k left a comment

z3d1k Aug 9, 2024

vahmed-hamdy commented Aug 12, 2024

robg-eb commented Aug 13, 2024

dzikosc Sep 12, 2024

nicusX Sep 14, 2024

robg-eb Sep 18, 2024

dzikosc Sep 12, 2024 •

edited

Loading

nicusX Sep 14, 2024

robg-eb Sep 18, 2024

nicusX Sep 18, 2024

dzikosc left a comment

dzikosc Sep 12, 2024 •

edited

Loading

nicusX Sep 14, 2024

robg-eb Sep 18, 2024

dzikosc Sep 12, 2024

nicusX Sep 14, 2024

nicusX Sep 14, 2024

nicusX Sep 14, 2024

nicusX Sep 14, 2024

nicusX Sep 14, 2024

hlteoh37 commented Nov 4, 2024

[FLINK-35500][Connectors/DynamoDB] DynamoDb Table API Sink fails to delete elements due to key not found #152

Are you sure you want to change the base?

[FLINK-35500][Connectors/DynamoDB] DynamoDb Table API Sink fails to delete elements due to key not found #152

Conversation

robg-eb commented Jul 24, 2024 • edited Loading

Purpose of the change

Verifying this change

Significant changes

boring-cyborg bot commented Jul 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vahmed-hamdy commented Jul 28, 2024

robg-eb commented Aug 1, 2024

z3d1k left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vahmed-hamdy commented Aug 12, 2024

robg-eb commented Aug 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dzikosc Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dzikosc left a comment

Choose a reason for hiding this comment

dzikosc Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hlteoh37 commented Nov 4, 2024

robg-eb commented Jul 24, 2024 •

edited

Loading

dzikosc Sep 12, 2024 •

edited

Loading

dzikosc Sep 12, 2024 •

edited

Loading