Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add note warning about NULL merge keys #196

Merged
merged 1 commit into from
Sep 12, 2017
Merged

Conversation

mjalkio
Copy link
Contributor

@mjalkio mjalkio commented Aug 18, 2017

As the description says, I just added a warning. I had a situation where one of my merge keys (I was using three) was allowed to be NULL and it didn't work because Redshift can't compare NULL = NULL.

Unfortunately, I don't think there's a way to support NULL merge keys.

@hito4t
Copy link
Contributor

hito4t commented Aug 23, 2017

Thank you for the PR!

As you wrote, merge mode doesn't work for NULL keys.

Unfortunately, I don't think there's a way to support NULL merge keys.

Following SQL will work for NULL keys.

 (tableA.colA1 = tableB.colB1 or tableA.colA1 is null and tableB.colB1 is null)
  and (tableA.colA2 = tableB.colB2 or tableA.colA2 is null and tableB.colB2 is null)
  ...

Do you think it is useful to support NULL keys?

@mjalkio
Copy link
Contributor Author

mjalkio commented Aug 27, 2017

Hmm the query currently winds up looking something like this:

INSERT INTO tableB (colb1, colb2)
UNION ALL (
  SELECT colA1, colA2
  FROM tableA
  WHERE (colA1, colA2) NOT IN (
    SELECT (colb1, colb2)
    FROM tableB
  )
) 

So I'm not sure if your technique would be able to be applied to this style of query.

@hito4t
Copy link
Contributor

hito4t commented Sep 5, 2017

Thank you for your reply.

The above SQL can be modified to support NULL keys as follows.

INSERT INTO tableB (colB1, colB2)
UNION ALL (
  SELECT colA1, colA2
  FROM tableA
  WHERE NOT EXISTS
  (SELECT * FROM tableB
  WHERE 
    (colB1 = colA1 OR colB1 IS NULL AND colA1 IS NULL) AND
    (colB2 = colA2 OR colB2 IS NULL AND colA2 IS NULL))
) 

But the SQL may be slower than the original SQL.
So even if embulk-output-redshift supports NULL merge keys, it should be optional.
(generates different SQLs depending on config property such as null_merge_keys.)

@mjalkio
Copy link
Contributor Author

mjalkio commented Sep 5, 2017

Wouldn't using WHERE NOT EXISTS return us to the same problem I fixed in #194 ?

@hito4t
Copy link
Contributor

hito4t commented Sep 5, 2017

The old wrong SQL is as follows.

  ...
  SELECT col1, col2, ...
  FROM temp
  WHERE NOT EXISTS
  (SELECT 1 FROM
    temp, target
    WHERE
      temp.col1 = target.col1 AND
      temp.col2 = target.col2 AND
      ...

The slip is selecting from temporary table in subquery.
The following SQL will work.

  ...
  SELECT col1, col2, ...
  FROM temp
  WHERE NOT EXISTS
  (SELECT 1 FROM
    target
    WHERE
      temp.col1 = target.col1 AND
      temp.col2 = target.col2 AND
      ...

@mjalkio
Copy link
Contributor Author

mjalkio commented Sep 5, 2017

I think the problem will still exist where WHERE NOT EXISTS will either evaluate to TRUE or FALSE for all rows in temp.

@hito4t
Copy link
Contributor

hito4t commented Sep 7, 2017

@mjalkio
In the former SQL, all rows in temp will be matched with all rows in target.
In the latter SQL, a temp row in the outside SELECT will be matched with all rows in target.

CREATE TABLE temp (
  col1 int,
  col2 int
);
INSERT INTO temp VALUES(11, 21);
INSERT INTO temp VALUES(12, 22);

CREATE TABLE target (
  col1 int,
  col2 int
);
INSERT INTO target VALUES(11, 21);
INSERT INTO target VALUES(13, 23);
SELECT col1, col2
FROM temp
WHERE NOT EXISTS
(SELECT 1 FROM
  temp, target
  WHERE
    temp.col1 = target.col1 AND
    temp.col2 = target.col2);

col1         col2
--------------------------
SELECT col1, col2
FROM temp
WHERE NOT EXISTS
(SELECT 1 FROM
  target
  WHERE
    temp.col1 = target.col1 AND
    temp.col2 = target.col2);

col1         col2
--------------------------
12           22

@mjalkio
Copy link
Contributor Author

mjalkio commented Sep 7, 2017

Hm, you're right. That's definitely not how I imagined that query would work, but it does!

@hito4t
Copy link
Contributor

hito4t commented Sep 12, 2017

We may support NULL merge keys in future ( #198 ).
Other embulk-output-jdbc plugins may not support NULL merge keys ( #199 ).
Anyway, I'll merge this PR.

@hito4t hito4t merged commit e353d89 into embulk:master Sep 12, 2017
@hito4t hito4t added this to the 0.7.12 milestone Nov 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants