Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connect_retries and connect_timeout parameters don't have an effect #778

Open
henlue opened this issue Aug 27, 2024 · 4 comments
Open

connect_retries and connect_timeout parameters don't have an effect #778

henlue opened this issue Aug 27, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@henlue
Copy link
Contributor

henlue commented Aug 27, 2024

Describe the bug

The connect_retries and connect_timeout parameters in the profiles.yml don't have the effect that is described in the docs.

The retry functionality seems to be implemented, but the list of exceptions for which a retry happens is empty by default (here and here). It is possible to configure the connector to retry on all exceptions by setting retry_all: true and this will make the connect_retries and connect_timeout have the documented effects, but the retry_all parameter is not documented. I only found it while checking the code.

Depending on the desired behavior I see various ways to fix this:

  1. Keep the existing behavior and update the documentation. For example by adding the retry_all parameter.
  2. Change the existing behavior to match the documentation. For example:
    a) Add transient exceptions to the retryable_exceptions list
    b) Set retry_all to true by default (not sure about the side effects though)
    c) Forward the connect_retries and connect_timeout parameters to the databricks sql connector, if this is possible.

I would be willing to implement a fix or to take a deeper look into the implications of the various fixes I've described.

Steps To Reproduce

I've created a profiles.yml with invalid connection parameters and a high number of connect_retries:

databricks:
  outputs:
    test:
      type: databricks
      host: invalid
      http_path: invalid
      token: invalid
      schema: schema
      connect_retries: 1000

then executed dbt run

Expected behavior

I expect 1000 retries. Instead dbt tries to establish the connection for 15 minutes, like it does when connect_retries is set to 1 and then fails.

System information

The output of dbt --version:

Core:
  - installed: 1.8.4
  - latest:    1.8.5 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - spark:      1.8.0 - Up to date!
  - databricks: 1.8.3 - Update available!

The operating system you're using:
Ubuntu 22.04
The output of python --version:
Python 3.10.12

Additional context

We use a classic warehouse on Azure for our daily jobs. By default dbt databricks tries for 15 minutes to establish a connection to the warehouse, but sometimes this is not enough for the warehouse to start.

@henlue henlue added the bug Something isn't working label Aug 27, 2024
@caineblood
Copy link

The above comment asking you to download a file is malware to steal your account; do not under any circumstances download or run it. The post needs to be removed. If you have attempted to run it please have your system cleaned and your account secured immediately.

@benc-db
Copy link
Collaborator

benc-db commented Sep 12, 2024

Thanks for the report. Quite a few users have been asking about retries lately, so I think I'll need to look into it.

@Tonayya
Copy link

Tonayya commented Oct 22, 2024

Hi @benc-db quick question on the connection retries actually. Which type of connection failures would this functionality actually retry on? For example, if connection fails due to cluster maintenance, would this retry the connection?

@benc-db
Copy link
Collaborator

benc-db commented Oct 22, 2024

Connection retries apply to situations where the SQL Gateway returns a 429 or 503, i.e. signals that it has not scheduled the request due to insufficient resources/busy. I believe this covers your case of cluster maintenance, but not 100% certain. You can ask on sql connector page for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants
@henlue @caineblood @Tonayya @benc-db and others