[Tracking] Callbacks, Cost Modelling, and MLFlow integration #10

athewsey · 2024-12-11T02:12:55Z

Issue #, if available: closes #8, closes #9

Description of changes:

This PR introduces a general callbacks mechanism to LLMeter, and implements two new features using it: Cost modelling, and MLFlow experiment tracking.

A callback is (similar to in HF Transformers) a class that may implement multiple available lifecycle hooks that LLMeter will call during a test Run. MlflowCallback uses only the after_run() hook to log experiment details to the active MLFlow environment, while CostModels also use before_run() and request-level hooks to estimate request-level and run-level costs. Since pricing is generally complicated for a wide variety of reasons (including volume discounts, private pricing, capacity reservations, etc etc), our cost model solution prioritizes giving users flexible tools to define their own models of request- and run-level costs - with space for providing automatic, end-to-end estimates in future (llmeter.callbacks.cost.providers) but caution on implementing them.

BREAKING CHANGES:

payload (and all other arguments) of Runner.run()must now be passed as keyword arguments (i.e. payload=...). This is to avoid potential reliance on argument sequencing, because run() can override all attributes of Runner (a large number) and there's nothing intrinsically special about payload coming first in this context.

Docs & tests checklist

Example notebook added for cost modelling
Example notebook added for MLFlow tracking
Main README updated if needed
Decent test coverage for cost modelling
Decent test coverage for MLFlow
Decent test coverage for core callback mechanism

Exclusions for future work

End-to-end de/serialization of callbacks: Pending wider review of de/serialization in LLMeter
Auto-retrieval of Bedrock LLM prices: AWS Price List API seems to only have partial information for Claude (input but not output tokens). Marketplace catalog discovery API is restricted. Requiring manual price entry for now seems cleaner than only having a partial solution.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

- Introduce new callback module with base and cost submodules - Add MLflow integration callback

- added nested option for MlflowCallback - revised typing for base Callback class - added callback mechanism to `LoadTest` - refactored callback mechanism to `Runner` and `run()` - added run start and end time to the run stats

Add `path` argument to base `Callback.save_to_file`. Make base `Callback.load_from_file` final, to indicate that overriding it won't do users any good. Take an initial stab at docstrings for the class.

Also fix broken test from last commit whoops

Add token granularity feature to CostPer*Token dimensions, and update their `rate`->`rate_per_million` to better align with common orders of magnitude

CostModel was not `await`ing the results of cost dimension `calculate` coroutines, yielding unusable outputs.

Switch partial import circular dependency resolution around: In general, callbacks depends on the rest of LLMeter and `Runner`'s annotation of `Callback` types is the exception. This way around should greatly reduce the number of places we have to put `if TYPE_CHECKING` switches.

Flatten cost results onto Result & InvocationResponse classes so they can be picked up by stats. Refactor cost model dimensions from list to a map keyed by name (which needed a small adjustment to serde `to_dict_recursive_generic` to accommodate dicts)

- Add cost calculation dimensions and modeling - Integrate cost tracking with base endpoint - Update runner and results to handle cost metrics - Add test coverage for cost dimensions and model

Cost Callback now handles generating result-level summary stats for response-level cost elements

Update cost callback model and results implementation Modify core results functionality in llmeter package to enable callbacks to contribute custom stats

MLflow Callback Changes: - Added automatic naming for nested runs Results Processing: added new private method `_update_contributed_stats()` to be used by callbacks to contribute stats to `result.stats`. Dependencies: - added mlflow as extra dependency

Push Result.stats-style formatting down into `CalculatedCostWithDimensions.summary_statistics` so the extra layer of transformation in CostModel can be removed. Also extend its `.{load_from|save_to}_namespace()` to support dicts, to further simplify stats preparation logic in CostModel.

…allbacks

Update CostModel to use `Result._update_contributed_stats` API instead of manipulating the private `_contributed_stats` dict.

Replace previous ternary `include_request_costs` option in CostModel `calculate_run_cost()` with a boolean `recalculate_request_costs`. The case of *excluding* request-level costs from the run-level output is no longer supported - as we expect it to be niche and potentially confusing.

Duplicate dimension name checking in CostModel constructor was not properly detecting duplicate dimension names within one dimension type (run vs request).

Replace placeholders in MlflowCallback and Callback.load_from_file with NotImplementedErrors specifying that they're still TODO. Hoist `Callback` to `callbacks.__init__` level. Misc docstring and license header updates

Start to split out Runner (for a group of runs) from Run (for a single specific run). Includes some updates for tests, but many are still broken. Run.output_path for now is still referring to the overal group's output path, not the individual Run.

Fix broken tests with _Run+Runner refactor. Runner.run() now properly awaits the returned _Run._run() coroutine. _Run now saves the run config and doesn't throw error when output_path is None. Several unit tests switched back from _Run to Runner, because they're testing payload configuration which is set at the Runner level.

Change `_Run.output_path` to refer to the specific run's output folder (i.e. no need to apply run name as suffix), while `Runner.output_path` continues to refer to the overall run group's output folder.

Automatic cost modelling for SageMaker Real-Time endpoints, including example notebook mention.

Fix experiments for recent change of Runner.run() accepting payload only as a keyword argument.

Rename integ test_sagemaker.py to avoid cache issues

Update before_run base classes, docstrings, and examples to reflect that now it receives the individual run's _RunConfig (actually the _Run object) - not a "Runner" which is notably different.

Add tests that _Run calls all expected callback hooks. It looks like before_invoke is actually not implemented yet!

athewsey · 2024-12-11T03:42:36Z

llmeter/callbacks/base.py

+    async def before_invoke(self, payload: dict) -> None:
+        """Lifecycle hook called before every `Endpoint.invoke()` request in a Run.
+
+        Args:
+            payload: The payload to be sent to the endpoint.
+        Returns:
+            None: If you'd like to modify the request `payload`, edit the dictionary in-place.
+        """
+        pass


It looks like this particular hook not actually implemented yet: See recently added failing runner.py tests test_run_callback_triggered and test_run_callbacks_triggered_in_order.

I think the most natural place would be _Run._invoke_n_no_wait(), but:

That's currently a sync method, which makes this tricky, and

We probably also want to take a (deep?) copy of the payload before passing it through the string of callbacks - since callbacks could in principle modify the payload, but a test run may loop through the same source payload multiple times.

I've added a non-async implementation of before_invoke(), with a .copy() to protect the payload from side effects. I've added a comment in the code indicating how the current implementation might affect the timing test, and added a ToDo for replacing the current sync method with two separate processes: payload pre-processing and endpoint invocation.

evidently it did not work. I'm commenting the before_invoke(), I think it's better not to release it until the implementation is resolved.

llmeter/callbacks/mlflow.py

athewsey · 2024-12-11T03:54:39Z

llmeter/results.py

+        for key, value in stats.items():
+            if not isinstance(value, Number):
+                raise ValueError(
+                    f"Value for key {key} must be a number, got {type(value)}"
+                )


There's a limitation at the moment where (even though it's possible to re-analyze a result with a different CostModel and update the result using save=True), if the analysis changes what cost dimensions exist then there's no way for the cost model to find and delete the old ones.

For now I figure it's not a blocker but we should probably warn about it somewhere? This restricted update_contributed_stats API is part of the reason for the limitation, but there'd be some extra work needed to fix it even if this was loosened up.

to be able to update/complement arbitrary modifications we would need a comprehensive lineage/undo history. that sounds like overkill. possible alternative approach, cost (and any additional custom record-level entries) should be forced to live in a separate file with referencing records by index. this would allow multiple cost analysis on the same initial data, and they would all coexist.
Thoughts?

athewsey · 2024-12-11T04:03:54Z

llmeter/runner.py

+        Args:
+            endpoint (Endpoint | dict | None): The LLM endpoint to be tested. **Must be set** at
+                either the Runner or specific Run level.
+            output_path (os.PathLike | str | None): The (cloud or local) base folder under which
+                run outputs and configurations should be stored. By default, a new `run_name`
+                sub-folder will be created under the Runner's `output_path` if set - otherwise
+                outputs will not be saved to file.
+            tokenizer (Tokenizer | Any | None): Optional tokenizer used to estimate input and
+                output token counts for endpoints that don't report exact information.
+            clients (int): The number of concurrent clients to use for sending requests.
+            n_requests (int | None): The number of LLM invocations to generate *per client*.
+            payload (dict | list[dict] | os.PathLike | str | None): The request data to send to the
+                endpoint under test. You can provide a single JSON payload (dict), a list of
+                payloads (list[dict]), or a path to one or more JSON/JSON-Lines files to be loaded
+                by `llmeter.prompt_utils.load_payloads()`. **Must be set** at either the Runner or
+                specific Run level.
+            run_name (str | None): Name to use for a specific test Run. By default, runs are named
+                with the date and time they're requested in format: `%Y%m%d-%H%M`
+            run_description (str | None): A natural-language description for the test Run.
+            timeout (int | float): The maximum time (in seconds) to wait for each response from the
+                endpoint.
+            callbacks (list[Callback] | None): Optional callbacks to enable during the test Run. See
+                `llmeter.callbacks` for more information.
+            disable_per_client_progress_bar (bool): Set `True` to disable per-client progress bars
+                from showing during the run.
+            disable_clients_progress_bar (bool): Set `True` to disable overall progress bar from
+                showing during the run.


I looked at patterns for re-using chunks between docstrings (for example HF Transformers makes extensive use of decorators to compose docstrings together) - but while these transformations get picked up by doc tools like sphinx and the built-in help() function, PyLance/VSCode tooltips doesn't seem able to see them :( So left everything as vanilla for now which means a bit of duplication & counter-intuitive placement in this file

athewsey · 2024-12-11T04:10:19Z

tests/integ/callbacks/cost/providers/test_sagemaker_integ.py

I think these are our only tests that depend on an AWS API to run... But moto doesn't provide mocking for pricing API so not much alternative for testing the SageMaker pricing auto-discovery.

Created this new tests/integ folder, and I'd probably suggest moving our existing ones under tests/unit or tests/local or something, for clear separation? But didn't do so yet because it'd be painful for reviewing the diffs.

after the initial callback release, let's move cost models and tests into a separate branch. i've the impression that complexity can balloon exponentially, and I'd rather contain it from spilling into the core functionalities.

… ToDo

acere and others added 30 commits November 25, 2024 18:58

Add callback functionality

1ee53be

- Introduce new callback module with base and cost submodules - Add MLflow integration callback

refactor callback system to use async methods and added MLflow callback

356c560

updated callbacks

49bfd2f

- added nested option for MlflowCallback - revised typing for base Callback class - added callback mechanism to `LoadTest` - refactored callback mechanism to `Runner` and `run()` - added run start and end time to the run stats

feat(cb): Add path arg to save, make load final

2695430

Add `path` argument to base `Callback.save_to_file`. Make base `Callback.load_from_file` final, to indicate that overriding it won't do users any good. Take an initial stab at docstrings for the class.

feat(cost): Initial CostModel callback draft

29b0ff8

feat(cost): Add CostPerOutputToken dimension

a2e372c

Also fix broken test from last commit whoops

feat(cost): Granularity and per-mil on token cost

e6d992a

Add token granularity feature to CostPer*Token dimensions, and update their `rate`->`rate_per_million` to better align with common orders of magnitude

fix(cost): Await CostDimension.calculate coros

b71597f

CostModel was not `await`ing the results of cost dimension `calculate` coroutines, yielding unusable outputs.

feat: implement cost tracking and measurement system

6d896c5

- Add cost calculation dimensions and modeling - Integrate cost tracking with base endpoint - Update runner and results to handle cost metrics - Add test coverage for cost dimensions and model

dropped additional_metrics_for_aggregation from Results

13265d5

refactor: Self-summarizing cost statistics

08cbc01

Cost Callback now handles generating result-level summary stats for response-level cost elements

refactor(cost): Patch Result.calculate_stats

22157e1

feat(callbacks): enhance cost tracking and results handling

09467fc

Update cost callback model and results implementation Modify core results functionality in llmeter package to enable callbacks to contribute custom stats

test: Update test runner and project configuration

3eeccfe

Merge branch 'callbacks' of https://github.com/awslabs/llmeter into c…

7c59b41

…allbacks

refactor(cost): Use _update_contributed_stats fn

9405f4d

Update CostModel to use `Result._update_contributed_stats` API instead of manipulating the private `_contributed_stats` dict.

doc(results): Clarify docstrs for stats contrib

95c978b

fix(cost): Detect dupe dim names of same type

5c683c2

Duplicate dimension name checking in CostModel constructor was not properly detecting duplicate dimension names within one dimension type (run vs request).

doc(cost): Minor docstring & test updates

7cd5031

test(cb): Fix broken tests

0e88f15

refactor: Add NotImplementedErrors, update docs

ec83639

Replace placeholders in MlflowCallback and Callback.load_from_file with NotImplementedErrors specifying that they're still TODO. Hoist `Callback` to `callbacks.__init__` level. Misc docstring and license header updates

feat(cost): Example notebook for cost

9284ad3

refactor: WIP Runner->Run+Runner

b7ccc50

Start to split out Runner (for a group of runs) from Run (for a single specific run). Includes some updates for tests, but many are still broken. Run.output_path for now is still referring to the overal group's output path, not the individual Run.

refactor: _Run.output_path is now run-specific

677cedf

Change `_Run.output_path` to refer to the specific run's output folder (i.e. no need to apply run name as suffix), while `Runner.output_path` continues to refer to the overall run group's output folder.

athewsey added 7 commits December 10, 2024 20:03

feat(cost): Auto cost model for SageMaker RT

17dce60

Automatic cost modelling for SageMaker Real-Time endpoints, including example notebook mention.

fix(experiments): Runner.run(payload) as kwarg

a927059

Fix experiments for recent change of Runner.run() accepting payload only as a keyword argument.

test: pytest doesn't like same-named test files

b7719fc

Rename integ test_sagemaker.py to avoid cache issues

fix(examples): run(payload) as keyword-only arg

5175976

doc(root): Cbs, cost model and MLFlow in README

1ed20e5

fix(cb): before_run receives RunConfig not Runner

906a2f4

Update before_run base classes, docstrings, and examples to reflect that now it receives the individual run's _RunConfig (actually the _Run object) - not a "Runner" which is notably different.

test(runner): Add (failing) tests for callbacks

ae75f0b

Add tests that _Run calls all expected callback hooks. It looks like before_invoke is actually not implemented yet!

athewsey commented Dec 11, 2024

View reviewed changes

acere added 2 commits December 18, 2024 13:21

feat(callbacks): document MLflow callback and add initial tests

b1511c5

feat: callbacks. added before_invoke() implementation

f901d15

acere marked this pull request as ready for review December 18, 2024 21:54

feat: callback - removed before_invoke() from callbacks, moved into…

7d0768e

… ToDo

acere merged commit cada484 into main Dec 19, 2024

acere deleted the callbacks branch December 19, 2024 04:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tracking] Callbacks, Cost Modelling, and MLFlow integration #10

[Tracking] Callbacks, Cost Modelling, and MLFlow integration #10

athewsey commented Dec 11, 2024 •

edited

Loading

athewsey Dec 11, 2024

acere Dec 18, 2024

acere Dec 19, 2024

athewsey Dec 11, 2024

acere Dec 18, 2024

athewsey Dec 11, 2024

athewsey Dec 11, 2024

acere Dec 18, 2024

[Tracking] Callbacks, Cost Modelling, and MLFlow integration #10

[Tracking] Callbacks, Cost Modelling, and MLFlow integration #10

Conversation

athewsey commented Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

athewsey commented Dec 11, 2024 •

edited

Loading