Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: added client-side instrumentation to all rpcs #925

Open
wants to merge 26 commits into
base: client_side_metrics_handlers
Choose a base branch
from

Conversation

daniel-sanche
Copy link
Contributor

@daniel-sanche daniel-sanche commented Jan 26, 2024

This PR builds off of #924 to add instrumentation to each rpc in the Bigtable v3 data client


This change required some changes to the gapic generated client, which will need to be upstreamed to the gapic generator at some point: googleapis/gapic-generator-python#1856

TODO:

  • add end-to-end system tests
  • get a final benchmarking in place
  • improve _mutate_rows instrumentation
  • more tests?
    • mutations batcher
    • _read_rows and _mutate_rows operations
      • sepcifically: _MutateRowsIncomplete exception

@daniel-sanche daniel-sanche requested review from a team as code owners January 26, 2024 23:26
@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. api: bigtable Issues related to the googleapis/python-bigtable API. labels Jan 26, 2024
del active_request_indices[result.index]
finally:
# send trailing metadata to metrics
result_generator.cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to call cancel in the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is calling cancel on the grpc stream. I think this is in case we encounter a client error; we need to stop the stream so it'll give us the trailing metadata

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What client error would happen? I'm just wondering if this will introduce some bugs (like not consuming all the results or cancelling the grpc stream twice)

# the value directly to avoid extra overhead
operation.active_attempt.application_blocking_time_ms += ( # type: ignore
time.monotonic() - block_time
) * 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we record everything in nanoseconds and convert them to milliseconds instead?

@@ -593,11 +595,20 @@ async def read_rows_stream(
)
retryable_excs = _get_retryable_errors(retryable_errors, self)

# extract metric operation if passed down through kwargs
# used so that read_row can disable is_streaming flag
metric_operation = kwargs.pop("metric_operation", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the API customer will interact with? This feels a bit weird 🤔 I think if someone calls readrows we can set is_streaming=True and OperationType to be ReadRows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hidden argument, users aren't supposed to interact with it

The problem is read_row is a small wrapper on top of read_rows. So read_rows can't assume the operation is streaming. This pattern is trying to allow read_row to pass down its own operation when it calls read_rows, so it can make sure streaming is False

We could also solve it using an entirely separate helper if this is too ugly though

Copy link
Contributor

@mutianf mutianf Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users aren't supposed to interact with it

Is it possible to make sure users can't interact with it? The fact that someone could look at the source code and pass in a random string (which will increase the cardinality) is a little concerning.

@@ -328,17 +331,26 @@ async def _flush_internal(self, new_entries: list[RowMutationEntry]):
"""
# flush new entries
in_process_requests: list[asyncio.Future[list[FailedMutationEntryError]]] = []
metric = self._table._metrics.create_operation(OperationType.BULK_MUTATE_ROWS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have this OperationType in java 😓 let's just use MUTATE_ROWS for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BULK_MUTATE_ROWS is just the name we've been using for mutate_rows, since it's easier to parse. The string value is still "MutateRows"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha!

timeout_val: float | None,
) -> tuple[Exception, Exception | None]:
exc, source = _retry_exception_factory(exc_list, reason, timeout_val)
if reason != RetryFailureReason.NON_RETRYABLE_ERROR:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be if reason = RetryFailureReason.NON_RETRYABLE_ERROR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the wrapped_predicate will report all exceptions encountered. But the predicate isn't called when the operation ends due to timeout. So this extra wrapper is needed for that

This part is a bit more complicated than I hoped, so I was considering trying to refactor some of the read_rows instrumentation when I have a chance. Let me know what you think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It makes sense but a little hard to follow. Maybe add a comment for now? And also in _read_rows start_operation comment what metric_fns[0] and metric_fns[1] are?

@daniel-sanche daniel-sanche force-pushed the client_side_metrics_instrumentation branch from fcbdda6 to ca0963a Compare February 9, 2024 02:01
@daniel-sanche daniel-sanche requested a review from a team as a code owner February 9, 2024 02:01
del active_request_indices[result.index]
finally:
# send trailing metadata to metrics
result_generator.cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What client error would happen? I'm just wondering if this will introduce some bugs (like not consuming all the results or cancelling the grpc stream twice)

# For each row
while True:
try:
c = await it.__anext__()
except StopAsyncIteration:
# stream complete
operation.end_with_success()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this happen when customer cancels the read in the middle of a stream?

timeout_val: float | None,
) -> tuple[Exception, Exception | None]:
exc, source = _retry_exception_factory(exc_list, reason, timeout_val)
if reason != RetryFailureReason.NON_RETRYABLE_ERROR:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It makes sense but a little hard to follow. Maybe add a comment for now? And also in _read_rows start_operation comment what metric_fns[0] and metric_fns[1] are?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigtable Issues related to the googleapis/python-bigtable API. size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants