New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[FLINK-33132] Flink Connector Redshift TableSink Implementation #114

Open

Samrat002 wants to merge 1 commit into apache:main from Samrat002:redshift-connector

Contributor

Samrat002 commented Nov 9, 2023 •

edited

Loading

Purpose of the change

Flink Connector Redshift Sink Implementation

Verifying this change

JDBC mode testing

CREATE TABLE users1 (
>     `id` BIGINT,
>     `data` STRING,
>     PRIMARY KEY (`id`) NOT ENFORCED
> ) WITH (
>     'connector' = 'redshift',
>     'sink.mode' = 'JDBC',
>     'sink.copy-mode.aws.s3-uri' = 's3://dbsamrat-aws-bucket/redshift/flink_sink/users/',
>     'sink.database-name' = 'flink_sink',
>     'hostname' = 'flink-redshift.xxxxxxxxxx.xx-xxxx-x.redshift.amazonaws.com',
>     'sink.aws.iam-role-arn' = 'arn:aws:iam::xxxxxxxxxxxxx:role/service-role/AmazonRedshift-CommandsAccessRole-xxxxxxxxxxxxxxx',
>     'username' = 'admin',
>     'password' = 'xxxxxx',
>     'port' = '5439',
>     'sink.batch-size' = '10',
>     'sink.flush-interval' = '10',
>     'sink.max-retries' = '2',
>     'sink.table-name' = 'users1'
> );

 CREATE TABLE datagentable (
>     id   INT,
>     data STRING
>   ) WITH ('connector' = 'datagen',  'number-of-rows' = '10000');

insert into user1 select * from datagentable;

Significant changes

(Please check any boxes [x] if the answer is "yes". You can first publish the PR and check them afterwards, for convenience.)

Dependencies have been added or upgraded
Public API has been changed (Public API is any class annotated with @Public(Evolving))
Serializers have been changed
New feature has been introduced
- If yes, how is this documented? (not applicable / docs / JavaDocs / not documented)

Samrat002 marked this pull request as draft

November 9, 2023 00:07

Samrat002 changed the title ~~[WIP][FLINK-33132] Flink Connector Redshift Sink Implementation~~ [FLINK-33132][WIP] Flink Connector Redshift Sink Implementation

Samrat002 force-pushed the redshift-connector branch from eaaa378 to 0c013d4 Compare

November 10, 2023 20:39

Samrat002 changed the title ~~[FLINK-33132][WIP] Flink Connector Redshift Sink Implementation~~ [FLINK-33132] Flink Connector Redshift TableSink Implementation

Samrat002 marked this pull request as ready for review

November 10, 2023 20:42

Samrat002 changed the title ~~[FLINK-33132] Flink Connector Redshift TableSink Implementation~~ [WIP][FLINK-33132] Flink Connector Redshift TableSink Implementation

Samrat002 marked this pull request as draft

November 15, 2023 12:46

Samrat002 force-pushed the redshift-connector branch from 118e644 to 92a7615 Compare

December 18, 2023 19:00

Samrat002 changed the title ~~[WIP][FLINK-33132] Flink Connector Redshift TableSink Implementation~~ [FLINK-33132] Flink Connector Redshift TableSink Implementation

Samrat002 marked this pull request as ready for review

December 18, 2023 19:01

Contributor Author

Samrat002 commented Dec 21, 2023

@hlteoh37, @vahmed-hamdy please review in free time 🙏🏻


          [FLINK-33132] Flink Connector Redshift

98901f4

Samrat002 force-pushed the redshift-connector branch from 92a7615 to 98901f4 Compare

January 13, 2024 20:34

melin commented Jan 20, 2024

1、The tuncate table paramter is supported in the batch import scenario. If data exists in a table, duplicate data will be generated and the table must be cleared first
2、Can upsert write data in Batch data import scenarios。
https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html

Contributor Author

Samrat002 commented Jan 30, 2024 •

edited

Loading

thank you for reviewing the pr .

1、The tuncate table paramter is supported in the batch import scenario. If data exists in a table, duplicate data will be generated and the table must be cleared first

If the record exisits in table and redshift table created contains primary key or composite key . it carries out merge into operation . If you check the code we are doing merge into operation if ddl contains primary key .

2、Can upsert write data in Batch data import scenarios。 https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html

can you please elaborate more , as per what i understand you are concern how staged data get merged , in code we are using https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html#merge-method-specify-column-list .

vahmed-hamdy suggested changes

View reviewed changes

Contributor

vahmed-hamdy left a comment

I have left some comments, I will continue the review later.
I believe this PR is incomplete right? we still need to add tests.

...ift/src/main/java/org/apache/flink/connector/redshift/table/RedshiftDynamicTableFactory.java

+              /** Dynamic Table Factory. */
+              @PublicEvolving
+              public class RedshiftDynamicTableFactory implements DynamicTableSinkFactory {
+                  public static final String IDENTIFIER = "redshift";

Contributor

vahmed-hamdy Mar 11, 2024

nit: I would move the configs to a separate file as in flink-connector-aws-kinesis-streams

...ift/src/main/java/org/apache/flink/connector/redshift/table/RedshiftDynamicTableFactory.java

+                  public static final ConfigOption<String> DATABASE_NAME =
+                          ConfigOptions.key("sink.database-name")
+                                  .stringType()
+                                  .defaultValue("dev")

Contributor

vahmed-hamdy Mar 11, 2024

Why do we need to set that?

Contributor Author

Samrat002 May 5, 2024

dev is the default database name created by redshift. i assumed if user dont provide database name as a config then it should assume database name as default one

...ift/src/main/java/org/apache/flink/connector/redshift/table/RedshiftDynamicTableFactory.java

+                                  .noDefaultValue()
+                                  .withDescription("AWS Redshift cluster sink table name.");
+                  public static final ConfigOption<Integer> SINK_BATCH_SIZE =

Contributor

vahmed-hamdy Mar 11, 2024

Suggestion: have you considered using AsyncDynamicTableSink it seems you are reusing some properties here

...ift/src/main/java/org/apache/flink/connector/redshift/table/RedshiftDynamicTableFactory.java

+                                  .stringType()
+                                  .noDefaultValue()
+                                  .withDescription("using Redshift COPY command must provide a S3 URI.");
+                  public static final ConfigOption<String> IAM_ROLE_ARN =

Contributor

vahmed-hamdy Mar 11, 2024

Should we use the existing aws authentication way?

...ift/src/main/java/org/apache/flink/connector/redshift/table/RedshiftDynamicTableFactory.java

+                                          "Currently, 2 modes are supported for Flink connector redshift.\n"
+                                                  + "\t 1) COPY Mode."
+                                                  + "\t 2) JDBC Mode.");
+                  public static final ConfigOption<String> TEMP_S3_URI =

Contributor

vahmed-hamdy Mar 11, 2024

Why TEMP?

Contributor Author

Samrat002 May 5, 2024

in copy mode , redhift data needs to be written into temporary s3 path. these path only useful till copy mode reads the data from temporary location (s3) and uploads to redhsift workers.
In flip this config was mentioned

...t/src/main/java/org/apache/flink/connector/redshift/format/AbstractRedshiftOutputFormat.java


		private static final Logger LOG = LoggerFactory.getLogger(AbstractRedshiftOutputFormat.class);

		protected transient volatile boolean closed = false;

Contributor

vahmed-hamdy Mar 11, 2024

this is smelly, have you tested that away from local clusters and with checkpointing?

...t/src/main/java/org/apache/flink/connector/redshift/format/AbstractRedshiftOutputFormat.java

+                  public synchronized void close() {
+                      if (!closed) {
+                          closed = true;

Contributor

vahmed-hamdy Mar 11, 2024

nit: remove new line

...t/src/main/java/org/apache/flink/connector/redshift/format/AbstractRedshiftOutputFormat.java

+                          try {
+                              flush();
+                          } catch (Exception exception) {
+                              LOG.warn("Flushing records to Redshift failed.", exception);

Contributor

vahmed-hamdy Mar 11, 2024

We are swallowing all exceptions here, this seems like a smell and could possibly break delivery guarantees. We should capture specific exceptions only and bubble/wrap up the rest.

...t/src/main/java/org/apache/flink/connector/redshift/format/AbstractRedshiftOutputFormat.java

+                  public void scheduledFlush(long intervalMillis, String executorName) {
+                      Preconditions.checkArgument(intervalMillis > 0, "flush interval must be greater than 0");
+                      scheduler = new ScheduledThreadPoolExecutor(1, new ExecutorThreadFactory(executorName));
+                      scheduledFuture =

Contributor

vahmed-hamdy Mar 11, 2024

This breaks the execution model, You should use the mailboxExecutor instead.

...ctor-redshift/src/main/java/org/apache/flink/connector/redshift/options/RedshiftOptions.java

+              import java.time.Duration;
+              import java.util.Optional;
+              /** Options. */

Contributor

vahmed-hamdy Mar 11, 2024

nit: Could we use more descriptive Javadoc for example: "Options to configure connection to redshift"

Contributor Author

Samrat002 commented May 5, 2024

I have left some comments, I will continue the review later. I believe this PR is incomplete right? we still need to add tests.

yes , tests were not added , since it will increase the size of PR.
As discussed offline , i will reduce the scope of this pr to only Async v2 and move other things to succeeding pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet