Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait replication-sync returns valid exit-code while CH connection error #231

Merged
merged 1 commit into from
Sep 13, 2024

Conversation

MedvedewEM
Copy link
Contributor

@MedvedewEM MedvedewEM commented Sep 6, 2024

In case of stopped CH list_table_replicas(ctx) query returns also ConnectionErrors.
Let's except these exceptions once outside for-loop.

@MedvedewEM MedvedewEM force-pushed the wait_replication_sync_exit_code branch 13 times, most recently from 638d891 to 788584f Compare September 9, 2024 16:40
@MedvedewEM MedvedewEM force-pushed the wait_replication_sync_exit_code branch from 788584f to 1f7f848 Compare September 12, 2024 12:52
@MedvedewEM MedvedewEM marked this pull request as ready for review September 12, 2024 14:10
Comment on lines +118 to +120
except requests.exceptions.ConnectionError:
logging.error("Connection error while running query.")
sys.exit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we should exit here? Is reraise the exception + logging.error or raise new exception with custom msg not enough?

Also maybe for requests.exceptions.ConnectionError add retries? Because we can have some short network issues

Copy link
Contributor Author

@MedvedewEM MedvedewEM Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we should exit here? Is reraise the exception + logging.error or raise new exception with custom msg not enough?

That is good point - what do you think about fix problem globally here - https://github.com/yandex/ch-tools/blob/main/ch_tools/chadmin/cli/chadmin_group.py#L54
Just adding raise in except branch. I afraid it is initial problem. And not only for wait replication-sync.

Also maybe for requests.exceptions.ConnectionError add retries? Because we can have some short network issues

It is already here - https://github.com/yandex/ch-tools/blob/main/ch_tools/common/clickhouse/client/clickhouse_client.py#L165

Copy link
Contributor

@MikhailBurdukov MikhailBurdukov Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we need to review all such emergency exists, we got a lot of them:

./ch_tools/s3_credentials/main.py:        sys.exit(1)
./ch_tools/s3_credentials/main.py:        sys.exit(1)
./ch_tools/s3_credentials/main.py:    sys.exit(0)
./ch_tools/monrun_checks/main.py:            sys.exit(1)
./ch_tools/monrun_checks/main.py:                sys.exit(1)
./ch_tools/chadmin/cli/wait_group.py:            sys.exit(1)
./ch_tools/chadmin/cli/wait_group.py:                sys.exit(1)
./ch_tools/chadmin/cli/wait_group.py:            sys.exit(0)
./ch_tools/chadmin/cli/wait_group.py:    sys.exit(1)
./ch_tools/chadmin/cli/wait_group.py:        sys.exit(0)
./ch_tools/chadmin/cli/wait_group.py:    sys.exit(1)
./ch_tools/chadmin/cli/zookeeper_group.py:    sys.exit(1)

Let's deal with it in the separate pr

It is already here

Thanks, missed it.

@MikhailBurdukov MikhailBurdukov merged commit 0e66e24 into main Sep 13, 2024
23 checks passed
@MikhailBurdukov MikhailBurdukov deleted the wait_replication_sync_exit_code branch September 13, 2024 12:52
@MikhailBurdukov
Copy link
Contributor

@MedvedewEM
Copy link
Contributor Author

MedvedewEM commented Sep 17, 2024

@MedvedewEM could you look into? https://github.com/yandex/ch-tools/actions/runs/10849461150/job/30108748612 Seems related

It is weird. Second retry was helpful.
It could be flap, but right now I do not have a clue why stopping fetches on node1 and calling chadmin wait replication-lag on the same node1 could return different timeout's exceptions (http.ReadTimeout vs ClickHouse.Timeout).
It should be same exception for each run...

Lets watch for this potential flap in coming PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants