Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait replication-sync returns valid exit-code while CH connection error #231

Merged
merged 1 commit into from
Sep 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 17 additions & 14 deletions ch_tools/chadmin/cli/wait_group.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
import sys
import time

import requests
from click import FloatRange, group, option, pass_context
from requests.exceptions import ReadTimeout

from ch_tools.chadmin.cli.chadmin_group import Chadmin
from ch_tools.chadmin.internal.clickhouse_disks import S3_METADATA_STORE_PATH
Expand Down Expand Up @@ -98,28 +98,31 @@ def wait_replication_sync_command(
start_time = time.time()
deadline = start_time + total_timeout.total_seconds()

# Sync tables in cycle
for replica in list_table_replicas(ctx):
full_name = f"`{replica['database']}`.`{replica['table']}`"
time_left = deadline - time.time()
timeout = min(replica_timeout.total_seconds(), time_left)
try:
# Sync tables in cycle
for replica in list_table_replicas(ctx):
full_name = f"`{replica['database']}`.`{replica['table']}`"
time_left = deadline - time.time()
timeout = min(replica_timeout.total_seconds(), time_left)

try:
execute_query(
ctx,
f"SYSTEM SYNC REPLICA {full_name}",
format_=None,
timeout=timeout,
settings={"receive_timeout": timeout},
)
except ReadTimeout:
logging.error("Timeout while running SYNC REPLICA on {}.", full_name)
except requests.exceptions.ReadTimeout:
logging.error("Read timeout while running query.")
sys.exit(1)
except requests.exceptions.ConnectionError:
logging.error("Connection error while running query.")
sys.exit(1)
Comment on lines +118 to +120
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we should exit here? Is reraise the exception + logging.error or raise new exception with custom msg not enough?

Also maybe for requests.exceptions.ConnectionError add retries? Because we can have some short network issues

Copy link
Contributor Author

@MedvedewEM MedvedewEM Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we should exit here? Is reraise the exception + logging.error or raise new exception with custom msg not enough?

That is good point - what do you think about fix problem globally here - https://github.com/yandex/ch-tools/blob/main/ch_tools/chadmin/cli/chadmin_group.py#L54
Just adding raise in except branch. I afraid it is initial problem. And not only for wait replication-sync.

Also maybe for requests.exceptions.ConnectionError add retries? Because we can have some short network issues

It is already here - https://github.com/yandex/ch-tools/blob/main/ch_tools/common/clickhouse/client/clickhouse_client.py#L165

Copy link
Contributor

@MikhailBurdukov MikhailBurdukov Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we need to review all such emergency exists, we got a lot of them:

./ch_tools/s3_credentials/main.py:        sys.exit(1)
./ch_tools/s3_credentials/main.py:        sys.exit(1)
./ch_tools/s3_credentials/main.py:    sys.exit(0)
./ch_tools/monrun_checks/main.py:            sys.exit(1)
./ch_tools/monrun_checks/main.py:                sys.exit(1)
./ch_tools/chadmin/cli/wait_group.py:            sys.exit(1)
./ch_tools/chadmin/cli/wait_group.py:                sys.exit(1)
./ch_tools/chadmin/cli/wait_group.py:            sys.exit(0)
./ch_tools/chadmin/cli/wait_group.py:    sys.exit(1)
./ch_tools/chadmin/cli/wait_group.py:        sys.exit(0)
./ch_tools/chadmin/cli/wait_group.py:    sys.exit(1)
./ch_tools/chadmin/cli/zookeeper_group.py:    sys.exit(1)

Let's deal with it in the separate pr

It is already here

Thanks, missed it.

except ClickhouseError as e:
if "TIMEOUT_EXCEEDED" in str(e):
logging.error("Timeout while running query.")
sys.exit(1)
except ClickhouseError as e:
if "TIMEOUT_EXCEEDED" in str(e):
logging.error("Timeout while running SYNC REPLICA on {}.", full_name)
sys.exit(1)
raise
raise

# Replication lag
while time.time() < deadline:
Expand Down
18 changes: 17 additions & 1 deletion tests/features/chadmin.feature
Original file line number Diff line number Diff line change
Expand Up @@ -145,14 +145,30 @@ Feature: chadmin commands.
"""
Then it fails with response contains
"""
Timeout while running SYNC REPLICA on
Read timeout while running query.
"""
When we execute query on clickhouse01
"""
SYSTEM START FETCHES
"""
When we execute command on clickhouse01
"""
supervisorctl stop clickhouse-server
"""
When we try to execute command on clickhouse01
"""
chadmin wait replication-sync --total-timeout 10 --replica-timeout 3 -p 1 -w 4
"""
Then it fails with response contains
"""
Connection error while running query.
"""
When we execute command on clickhouse01
"""
supervisorctl start clickhouse-server
"""
When we execute command on clickhouse01
"""
chadmin wait replication-sync --total-timeout 10 --replica-timeout 3 -p 1 -w 4
"""

Expand Down
Loading