feat: client side metrics data model #923

daniel-sanche · 2024-01-26T23:24:10Z

This PR adds the data model for the client-side metrics system

Follow-up PRs:

Design

The main architecture looks like this:

Most of the work is done by the ActiveOperationMetric class, which is instantiated with each rpc call, and updated through the lifecycle of the call. When the rpc is complete, it will call on_operation_complete and on_attempt_complete on the MetricsHandler, which can then log the completed data into OpenTelemetry (or theoretically, other locations if needed)

Note that there are separate classes for active vs completed metrics (ActiveOperationMetric, ActiveAttemptMetric, CompletedOperationMetric, CompletedAttemptMetric). This is so that we can keep fields mutable and optional while the request is ongoing, but pass down static immutable copies once the attempt is completed and no new data is coming

mutianf · 2024-02-08T20:52:20Z

google/cloud/bigtable/data/_metrics/data_model.py

+
+# by default, exceptions in the metrics system are logged,
+# but enabling this flag causes them to be raised instead
+ALLOW_METRIC_EXCEPTIONS = os.getenv("BIGTABLE_METRICS_EXCEPTIONS", False)


Yeah I don't think we should ever break the client, the exporter should just run in the background and log errors if there are any

google/cloud/bigtable/data/_metrics/data_model.py

mutianf · 2024-02-08T21:11:55Z

google/cloud/bigtable/data/_metrics/data_model.py

+    completed rpc attempt.
+    """
+
+    start_time: datetime.datetime


Yes, these are operation level labels and should be the same across multiple attempts. But Attempts are also labeled with these fields, so want to make sure they're added to the attributes later :)

mutianf · 2024-02-08T21:14:48Z

google/cloud/bigtable/data/_metrics/data_model.py

+        new_attempt = CompletedAttemptMetric(
+            start_time=self.active_attempt.start_time.utc,
+            first_response_latency_ms=self.active_attempt.first_response_latency_ms,
+            duration_ms=duration_seconds * 1000,


is it possible to measure it in nano seconds? seconds precision seems too low :(

duration_seconds is actually higher precision than seconds already, because it's a float value.

The docs say this about the precision: "Use monotonic_ns() to avoid the precision loss caused by the float type."

So I think we should already be at sub-millisecond percision, but if that's not good enough we can change everything to monotonic_ns to get full int nanoseconds everywhere

gotcha. Let's use monotic_ns and convert everything to milliseconds. The bucketing in OTEL is different from server side. OTEL buckets uses (start, end] while server uses [start, end). Recording everything in float histogram can minimize these off by 1 errors.

Sure, just converted everything to int nanos

mutianf

LGTM after nits

mutianf · 2024-02-09T18:41:59Z

google/cloud/bigtable/data/_helpers.py

+    """
+    history = []
+    subgenerator = exponential_sleep_generator(initial, multiplier, maximum)
+    while True:


dumb question: when will it break out of the loop?

This is a python generator function. It gives up control on each yield line

The idea is that you get an instance like generator = backoff_generator(...), and then you can call next(generator) or generator.send(idx) on it every time you want to retrieve a value. This will run the internal code until it reaches the next yield, and then pause execution again until next time a value is requested

This seems a bit too surprising of an api. I think it would be a lot cleaner and easier to read if Attempt.start() took a delay parameter

mutianf · 2024-02-09T18:46:32Z

google/cloud/bigtable/data/_metrics/data_model.py

+
+# by default, exceptions in the metrics system are logged,
+# but enabling this flag causes them to be raised instead
+ALLOW_METRIC_EXCEPTIONS = os.getenv("BIGTABLE_METRICS_EXCEPTIONS", False)


Hmmm this still seems like an option? 😅 I think we should just remove this option

mutianf · 2024-02-09T18:51:47Z

google/cloud/bigtable/data/_metrics/data_model.py

+    completed rpc attempt.
+    """
+
+    start_time: datetime.datetime


gotcha, this makes sense! Can we also add this explanation to the document? maybe something like Operation level fields can be accessed from ActvieOperationMetric

google/cloud/bigtable/data/_metrics/data_model.py

mutianf · 2024-02-09T19:41:22Z

google/cloud/bigtable/data/_metrics/data_model.py

+        new_attempt = CompletedAttemptMetric(
+            start_time=self.active_attempt.start_time.utc,
+            first_response_latency_ms=self.active_attempt.first_response_latency_ms,
+            duration_ms=duration_seconds * 1000,


gotcha. Let's use monotic_ns and convert everything to milliseconds. The bucketing in OTEL is different from server side. OTEL buckets uses (start, end] while server uses [start, end). Recording everything in float histogram can minimize these off by 1 errors.

mutianf · 2024-02-09T19:55:02Z

google/cloud/bigtable/data/_metrics/data_model.py

+
+    def end_attempt_with_status(self, status: StatusCode | Exception) -> None:
+        """
+        Called to mark the end of a failed attempt for the operation.


should this comment be "Called to mark the end of a attempt for the operation."? it's also called in end_with_status, where the status could be OK

That's a good point. Usually users of this code won't call end_attempt_with_status after a successful attempt, because a successful attempt also means a successful operation. But it is used that way internally. I'll change this comment to try to make it more clear

mutianf · 2024-02-09T20:07:27Z

google/cloud/bigtable/data/_metrics/data_model.py

+    preferred for calculations because it is resilient to clock changes, eg DST
+    """
+
+    utc: datetime.datetime = field(


since we're only measuring latencies, why do we need the utc timestamp?

Hmm good point. I thought that the wall-time timestamp was important to collect, but maybe we don't need it. I pulled it out

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

mutianf · 2024-02-23T16:01:55Z

google/cloud/bigtable/data/_metrics/data_model.py

+            )
+        if isinstance(status, Exception):
+            status = self._exc_to_status(status)
+        new_attempt = CompletedAttemptMetric(


nit: should we rename new_attempt to current_attempt? new_attempt sounds like we're creating an object for the next attempt? 🤔

sure, renamed to completed_attempt

igorbernstein2 · 2024-03-08T18:10:23Z

google/cloud/bigtable/data/_helpers.py

+    """
+    history = []
+    subgenerator = exponential_sleep_generator(initial, multiplier, maximum)
+    while True:


This seems a bit too surprising of an api. I think it would be a lot cleaner and easier to read if Attempt.start() took a delay parameter

google/cloud/bigtable/data/_metrics/data_model.py

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

igorbernstein2 · 2024-06-04T14:17:05Z

google/cloud/bigtable/data/_metrics/data_model.py

+            )
+
+        # find backoff value
+        if self.backoff_generator and len(self.completed_attempts) > 0:


nit. self.backoff_generator and self.completed_attempts

igorbernstein2 · 2024-06-04T14:19:02Z

google/cloud/bigtable/data/_metrics/data_model.py

+    """
+
+    op_type: OperationType
+    backoff_generator: BackoffGenerator | None = None


Might be good to add a comment explaining when this is None ... I'm guessing for non-retriable operations?

Separately I'm not sure that this will work for RetryInfo in response trailers (where the server specifies how much to sleep). Might be better to just pass the amount slept as an arg to start_attempt

igorbernstein2 · 2024-06-04T14:23:03Z

google/cloud/bigtable/data/_metrics/data_model.py

+    backoff_generator: BackoffGenerator | None = None
+    # keep monotonic timestamps for active operations
+    start_time_ns: int = field(default_factory=time.monotonic_ns)
+    active_attempt: ActiveAttemptMetric | None = None


When would this be non-None?

igorbernstein2 · 2024-06-04T14:25:54Z

google/cloud/bigtable/data/_metrics/data_model.py

+    cluster_id: str | None = None
+    zone: str | None = None
+    completed_attempts: list[CompletedAttemptMetric] = field(default_factory=list)
+    is_streaming: bool = False  # only True for read_rows operations


Thats not entirely true, it would also be true for CDC if we were to support that in this client and we have a couple of other features that would set this to true. I would remove the comment so it doesnt get stale

igorbernstein2 · 2024-06-04T14:32:12Z

google/cloud/bigtable/data/_metrics/metrics_controller.py

+        """
+        Creates a new operation and registers it with the subscribed handlers.
+        """
+        handlers = self.handlers + kwargs.pop("handlers", [])


whats the usecase for adding a handler per operation?

igorbernstein2 · 2024-06-04T14:42:28Z

google/cloud/bigtable/data/_metrics/data_model.py

+        self, inner_predicate: Callable[[Exception], bool]
+    ) -> Callable[[Exception], bool]:
+        """
+        Wrapps a predicate to include metrics tracking. Any call to the resulting predicate


nit s/Wrapps/Wrap

Im having a hard time wrapping my head around this....whats an example of a predicate that will be wrapped?

igorbernstein2 · 2024-06-04T14:43:58Z

google/cloud/bigtable/data/_metrics/data_model.py

+          - exc: The exception to extract the status code from.
+        """
+        if isinstance(exc, bt_exceptions._BigtableExceptionGroup):
+            exc = exc.exceptions[-1]


Why the last one? please add a note

igorbernstein2 · 2024-06-04T14:44:48Z

google/cloud/bigtable/data/_metrics/data_model.py

+        if (
+            exc.__cause__
+            and hasattr(exc.__cause__, "grpc_status_code")
+            and exc.__cause__.grpc_status_code is not None
+        ):
+            return exc.__cause__.grpc_status_code


is a single level enough? should this be recursive?

igorbernstein2 · 2024-06-04T14:45:54Z

google/cloud/bigtable/data/_metrics/data_model.py

+            self,
+            fn: Callable[..., Any],
+            *,
+            extract_call_metadata: bool = True,


why would this be false?

igorbernstein2 · 2024-06-04T14:47:07Z

google/cloud/bigtable/data/_metrics/data_model.py

+            extract_call_metadata: bool = True,
+        ) -> Callable[..., Any]:
+            """
+            Wraps a function call, tracing metadata along the way


Does this wrap an attempt or an operation?

daniel-sanche and others added 30 commits March 14, 2023 15:47

feat!: add new v3.0.0 API skeleton (#745)

b5b62c8

chore: merge branch 'main' into v3

7d51eeb

feat: improve rows filters (#751)

507da99

feat: read rows query model class (#752)

71b0312

feat: implement row and cell model classes (#753)

c55099f

feat: add pooled grpc transport (#748)

f9a1907

feat: implement read_rows (#762)

3de7a68

feat: implement mutate rows (#769)

9b81289

feat: literal value filter (#767)

ec3fd01

feat: row_exists and read_row (#778)

5d65703

feat: read_modify_write and check_and_mutate_row (#780)

432d159

feat: sharded read rows (#766)

ec2b983

feat: ping and warm with metadata (#810)

ceaf598

feat: mutate rows batching (#770)

1ecf65f

chore: restructure module paths (#816)

eedde1e

feat: improve timeout structure (#819)

07438ca

fix: api errors apply to all bulk mutations

0d92a84

chore: reduce public api surface (#820)

a8cdf7c

feat: improve error group tracebacks on < py11 (#825)

aa760b2

feat: optimize read_rows (#852)

0323dde

chore: add user agent suffix (#842)

0b3606f

feat: optimize retries (#854)

b6d232a

feat: add test proxy (#836)

8708a25

chore(tests): add conformance tests to CI for v3 (#870)

1d3a7c1

chore(tests): turn off fast fail for conformance tets (#882)

50531e5

feat: add TABLE_DEFAULTS enum for table method arguments (#880)

8ff1216

fix: pass None for retry in gapic calls (#881)

94bfe66

feat: replace internal dictionaries with protos in gapic calls (#875)

3ac80a9

chore: optimize gapic calls (#863)

b191451

feat: expose retryable error codes to users (#879)

285cdd3

daniel-sanche added 2 commits February 1, 2024 13:34

pulled backoff generator into PR

00d70db

fixed checks

24432c8

mutianf reviewed Feb 8, 2024

View reviewed changes

daniel-sanche changed the base branch from experimental_v3 to main February 8, 2024 23:39

daniel-sanche requested review from a team as code owners February 8, 2024 23:39

daniel-sanche requested a review from leahecole February 8, 2024 23:39

daniel-sanche changed the base branch from main to experimental_v3 February 8, 2024 23:40

Merge branch 'main' into client_side_metrics_data_model

619d982

daniel-sanche changed the base branch from experimental_v3 to main February 8, 2024 23:49

removed exception flag

2c8506d

mutianf reviewed Feb 9, 2024

View reviewed changes

daniel-sanche and others added 4 commits February 16, 2024 14:18

added comments

7408722

improved server timing regex

e1fe3ab

change float ms to int ns for timestamps

6d2e9f4

🦉 Updates from OwlBot post-processor

6bf526e

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

mutianf approved these changes Feb 23, 2024

View reviewed changes

daniel-sanche added 2 commits February 23, 2024 15:31

renamed variable

600ef0b

lint

10ef5d1

igorbernstein2 requested changes Mar 8, 2024

View reviewed changes

daniel-sanche and others added 7 commits March 15, 2024 14:35

turned backoff_generator into a class

f9dab5d

🦉 Updates from OwlBot post-processor

1986b06

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

added backoff generator test

035dab1

🦉 Updates from OwlBot post-processor

5a63965

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

removed unused import

a5aceef

Merge branch 'main' into client_side_metrics_data_model

24fe643

fixed test

b1d08c9

leahecole removed their request for review May 21, 2024 14:26

igorbernstein2 reviewed Jun 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: client side metrics data model #923

feat: client side metrics data model #923

daniel-sanche commented Jan 26, 2024 •

edited

Loading

mutianf Feb 8, 2024

mutianf Feb 8, 2024

mutianf Feb 8, 2024

daniel-sanche Feb 9, 2024

mutianf Feb 9, 2024

daniel-sanche Feb 16, 2024

mutianf left a comment

mutianf Feb 9, 2024

daniel-sanche Feb 16, 2024

igorbernstein2 Mar 8, 2024

mutianf Feb 9, 2024

mutianf Feb 9, 2024

mutianf Feb 9, 2024

mutianf Feb 9, 2024

daniel-sanche Feb 16, 2024

mutianf Feb 9, 2024

daniel-sanche Feb 16, 2024

mutianf Feb 23, 2024

daniel-sanche Feb 23, 2024

igorbernstein2 Mar 8, 2024

igorbernstein2 Jun 4, 2024

igorbernstein2 Jun 4, 2024

igorbernstein2 Jun 4, 2024

igorbernstein2 Jun 4, 2024

igorbernstein2 Jun 4, 2024

igorbernstein2 Jun 4, 2024

igorbernstein2 Jun 4, 2024

igorbernstein2 Jun 4, 2024

igorbernstein2 Jun 4, 2024

igorbernstein2 Jun 4, 2024

feat: client side metrics data model #923

Are you sure you want to change the base?

feat: client side metrics data model #923

Conversation

daniel-sanche commented Jan 26, 2024 • edited Loading

Design

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mutianf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-sanche commented Jan 26, 2024 •

edited

Loading