Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Prevent Oracle LOB columns from causing exceptions in Column and Row validation #1339

Closed
wants to merge 24 commits into from

Conversation

nj1973
Copy link
Contributor

@nj1973 nj1973 commented Nov 21, 2024

Oracle LOB columns are pretty restrictive to use, for example:

SQL> select count(col_clob) from pso_data_validator.dvt_ora2pg_types;
select count(col_clob) from pso_data_validator.dvt_ora2pg_types
             *
ERROR at line 1:
ORA-00932: inconsistent datatypes: expected - got CLOB

In DVT we lose the true data type and work with Ibis canonical types, for example a CLOB column becomes dt.String, a BLOB column becomes dt.Binary. This PR adds code to fetch the raw data types for Oracle tables/custom queries and excludes BLOB, CLOB and NCLOB from Column and Row (hash/concat) validations.

I only added the raw_metadata() for Oracle because that is the only engines we currently need it for. If we run into similar issues for other engines where we need to differentiate between specific raw data types then we can extend this technique further.

Ideally we would be able to add expressions to truly validate these types. I see that as a feature request rather than an exception avoidance issue which is what this one is.

@nj1973 nj1973 requested a review from a team as a code owner November 21, 2024 17:14
@nj1973 nj1973 marked this pull request as draft November 21, 2024 17:14
@nj1973 nj1973 marked this pull request as ready for review November 22, 2024 06:31
data_validation/config_manager.py Show resolved Hide resolved
data_validation/config_manager.py Show resolved Hide resolved
data_validation/config_manager.py Show resolved Hide resolved
data_validation/config_manager.py Outdated Show resolved Hide resolved
third_party/ibis/ibis_oracle/__init__.py Outdated Show resolved Hide resolved
# Oracle LOB columns are invalid for use with SQL COUNT function.
elif agg_type == "count" and self._is_oracle_lob(column):
# TODO Eventually we need to allow COUNT(Oracle LOB) by using a CASE expression:
# COUNT(CASE WHEN clob_col IS NULL THEN NULL ELSE 1 END)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why skip this column when you can change the query builder here , using the ibis.cases method. Isn't that a simpler and better solution ?

Thanks.

Sundar Mudupalli

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I can skip it in the query builder because I don't have data type information for both the source and target. I believe ConfigManager is where I have all of the information required to exclude a column from both source and target queries.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Based on your discussion during the sync call, is this comment going to be resolved as-is or there will be code changes related to it?

@helensilva14
Copy link
Collaborator

All good from my side and only @sundar-mudupalli-work's comment is left to be resolved.

/gcbrun

@helensilva14
Copy link
Collaborator

@nj1973 - CLA check is failing because of my former Google email not being found anymore, sorry about that. I will need to go through the commits I did with the corp email and remove it from the authors

nj1973 and others added 12 commits December 2, 2024 12:50
* tests: Check Oracle integration tests with Python 3.11
…1340)

* feat: Support customer defined api endpoint for BigQuery and Spanner

* feat: Support customer defined api endpoint for BigQuery result handler

* Update data_validation/result_handlers/bigquery.py

Co-authored-by: Helen Cristina <[email protected]>

* chore: PR comments

* chore: Reformat

---------

Co-authored-by: Helen Cristina <[email protected]>
* Create SECURITY.md

* chore: Switched SECURITY.md to Google recommended content

---------

Co-authored-by: nj1973 <[email protected]>
* Create CODE_OF_CONDUCT.md

* Update CODE_OF_CONDUCT.md
nj1973 and others added 2 commits December 2, 2024 12:56
* chore: Add sample files using DVT for object reconcilliation

* Update samples/object_comparisons/README.md

Co-authored-by: Helen Cristina <[email protected]>

* Update samples/object_comparisons/reconcile.sh

Co-authored-by: Helen Cristina <[email protected]>

---------

Co-authored-by: Helen Cristina <[email protected]>
* docs: Refine connection secret support section
@helensilva14 helensilva14 force-pushed the 1335-postgresql-jsonb-data-type-support branch from 0fe64c4 to b38dc87 Compare December 2, 2024 18:10
@pull-request-size pull-request-size bot added size/XL and removed size/L labels Dec 2, 2024
@pull-request-size pull-request-size bot added size/L and removed size/XL labels Dec 2, 2024
@helensilva14
Copy link
Collaborator

/gcbrun

@sundar-mudupalli-work
Copy link
Collaborator

Neil,

This PR is a bit confusing. I thought it was ensuring that exceptions were not thrown when CLOBs and BLOBs are row and column validated. However I find that column validation with clobs and blobs works well. Column validations correctly count the number of non null columns I believe.

This PR removes (count) column validations of LOB and BLOB columns. I believe min, max etc are also supported. These should likely stay as they are today.

The issue is with row validations, which throw an error. It would be nice to fix this and put out an info message. This PR fixes this and only puts out an INFO message if the verbose option is selected. We also have another case where an INFO message is output (when verbose option is selected) when a column is removed. In both cases, I believe an INFO message should be output - not just when verbose option is selected. I also opened another Issue to remove the verbose option entirely.

My general philosophy on validations (please feel free to disagree) is that the customer is trusting DVT to tell them when two tables are different. If we are excluding items from the validation - we should tell the customer explicitly that these are being excluded. We should not quietly exclude things and only disclose information when the verbose option is used.

Thank you for working on this.

Sundar Mudupalli

@nj1973
Copy link
Contributor Author

nj1973 commented Dec 3, 2024

I find that column validation with clobs and blobs works well. Column validations correctly count the number of non null columns I believe.

I retested the test case in the issue (#1335) and I still get the same exception:

sqlalchemy.exc.DatabaseError: (cx_Oracle.DatabaseError) ORA-00932: inconsistent datatypes: expected - got CLOB

BLOB columns will not have thrown an exception in the past because they are skilled due to this code (which I have replaced in this PR):

                if agg_type == "count":
                    # Oracle BLOB is invalid for use with SQL COUNT function.
                    return bool(
                        self.source_client.name == "oracle"
                        or self.target_client.name == "oracle"
                    )
                else:

@nj1973
Copy link
Contributor Author

nj1973 commented Dec 3, 2024

/gcbrun

@nj1973
Copy link
Contributor Author

nj1973 commented Dec 4, 2024

Closing this PR because it is based on a misunderstanding of current behaviour. I'll start over and maybe cherry pick any parts of these changes I need for second attempt.

@nj1973 nj1973 closed this Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prevent column validation exceptions caused by Oracle CLOB/NCLOB columns used for JSON
3 participants