Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with custom bigquery api endpoint #1369

Open
nj1973 opened this issue Dec 11, 2024 · 2 comments
Open

Issues with custom bigquery api endpoint #1369

nj1973 opened this issue Dec 11, 2024 · 2 comments
Assignees
Labels
type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@nj1973
Copy link
Contributor

nj1973 commented Dec 11, 2024

In issue #1337 we added support for customers to define a custom endpoint. This was tested by the requesting and confirmed as working, however, as they have continued their testing they have run into further issues.

Example (sanitised) command:

$ data-validation -v --log-level DEBUG validate column -sc bq_conn -tc bq_conn \
-tbls OWN.TAB=OWN.TAB --filters 'TS_COL > TIMESTAMP(CURRENT_DATE())'
...
11/26/2024 01:23:36 PM-DEBUG: Starting new HTTPS connection (1): bigquery-priv.p.googleapis.com:443
11/26/2024 01:23:36 PM-DEBUG: https://bigquery-priv.p.googleapis.com:443 "GET /bigquery/v2/projects/proj/datasets/OWN/tables/TAB?prettyPrint=false HTTP/11" 200 None
11/26/2024 01:23:36 PM-DEBUG: https://bigquery-priv.p.googleapis.com:443 "GET /bigquery/v2/projects/proj/datasets/OWN/tables/TAB?prettyPrint=false HTTP/11" 200 None
11/26/2024 01:23:36 PM-INFO: {'data_client': <ibis.backends.bigquery.Backend object at 0x7f4a5d49da90>, 'schema_name': 'OWN', 'table_name': 'TAB', 'source_query': None}
11/26/2024 01:23:36 PM-INFO: -- ** Source Query ** --
11/26/2024 01:23:36 PM-INFO: SELECT count(1) AS `count`
FROM `proj.OWN.TAB` t0
WHERE TS_COL > TIMESTAMP(CURRENT_DATE())
11/26/2024 01:23:36 PM-DEBUG: Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)

... same as above but for target connection ...

11/26/2024 01:24:03 PM-DEBUG: Retrying due to 503 failed to connect to all addresses; last error: UNKNOWN: ipv4:172.x.y.z:443: Failed to connect to remote host: Timeout occurred: FD Shutdown, sleeping 0.0s ...
11/26/2024 01:24:03 PM-DEBUG: Retrying due to 503 failed to connect to all addresses; last error: UNKNOWN: ipv4:142.x.y.z:443: Failed to connect to remote host: Timeout occurred: FD Shutdown, sleeping 0.1s ...
11/26/2024 01:24:03 PM-DEBUG: Retrying due to 503 failed to connect to all addresses; last error: UNKNOWN: ipv4:172.x.y.z:443: Failed to connect to remote host: Timeout occurred: FD Shutdown, sleeping 0.0s …

We can see some successful interactions with the correct private endpoint:

11/26/2024 01:23:36 PM-DEBUG: https://bigquery-priv.p.googleapis.com:443 "GET /bigquery/v2/projects/proj/datasets/OWN/tables/TAB?prettyPrint=false HTTP/11" 200 None

But start trying to access blocked IPs when executing the query.

@nj1973 nj1973 self-assigned this Dec 11, 2024
@nj1973
Copy link
Contributor Author

nj1973 commented Dec 11, 2024

Interestingly this query works:

$ data-validation -v --log-level DEBUG query --conn bq_conn --query 'SELECT count(1) AS `count` FROM `proj.O_TEST.T1` t0'
...
[(1,)]

As does this:

$ data-validation -v --log-level DEBUG query --conn bq_conn --query 'SELECT count(1) AS `count` FROM `proj.OWN.TAB` where TS_COL > TIMESTAMP(CURRENT_DATE()) '
...
[(349893,)]

The second query above it the same query as DVT generated in the failing command. The only difference appears to be that one is from data-validation query and the other from data-validation validate column.

@helensilva14 helensilva14 added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Dec 11, 2024
@nj1973
Copy link
Contributor Author

nj1973 commented Dec 12, 2024

We are missing an override for the BigQuery Storage API endpoint.

data-validation query is not using the storage API, when I use data-validation validate column we pass through some Ibis to_arrow code which sends us down a different path and interacts with the storage API.

In Ibis v6 and upwards we have the option of passing in a BigQuery client and a BigQuery storage API client which has two benefits.

  1. We would be able to easily fix this problem
  2. We could avoid the monkey patching we do now for the BigQuery client which ends up making two connections, the standard Ibis one and then our custom one which overrides the original.

Obviously upgrading Ibis is non trivial so I need to consider other options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
2 participants