Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking] Callbacks, Cost Modelling, and MLFlow integration #10

Merged
merged 40 commits into from
Dec 19, 2024
Merged

Conversation

athewsey
Copy link
Contributor

@athewsey athewsey commented Dec 11, 2024

Issue #, if available: closes #8, closes #9

Description of changes:

This PR introduces a general callbacks mechanism to LLMeter, and implements two new features using it: Cost modelling, and MLFlow experiment tracking.

A callback is (similar to in HF Transformers) a class that may implement multiple available lifecycle hooks that LLMeter will call during a test Run. MlflowCallback uses only the after_run() hook to log experiment details to the active MLFlow environment, while CostModels also use before_run() and request-level hooks to estimate request-level and run-level costs. Since pricing is generally complicated for a wide variety of reasons (including volume discounts, private pricing, capacity reservations, etc etc), our cost model solution prioritizes giving users flexible tools to define their own models of request- and run-level costs - with space for providing automatic, end-to-end estimates in future (llmeter.callbacks.cost.providers) but caution on implementing them.

BREAKING CHANGES:

  • payload (and all other arguments) of Runner.run()must now be passed as keyword arguments (i.e. payload=...). This is to avoid potential reliance on argument sequencing, because run() can override all attributes of Runner (a large number) and there's nothing intrinsically special about payload coming first in this context.

Docs & tests checklist

  • Example notebook added for cost modelling
  • Example notebook added for MLFlow tracking
  • Main README updated if needed
  • Decent test coverage for cost modelling
  • Decent test coverage for MLFlow
  • Decent test coverage for core callback mechanism

Exclusions for future work

  • End-to-end de/serialization of callbacks: Pending wider review of de/serialization in LLMeter
  • Auto-retrieval of Bedrock LLM prices: AWS Price List API seems to only have partial information for Claude (input but not output tokens). Marketplace catalog discovery API is restricted. Requiring manual price entry for now seems cleaner than only having a partial solution.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

acere and others added 30 commits November 25, 2024 18:58
- Introduce new callback module with base and cost submodules
- Add MLflow integration callback
- added nested option for MlflowCallback
- revised typing for base Callback class
- added callback mechanism to `LoadTest`
- refactored callback mechanism to `Runner` and `run()`
- added run start and end time to the run stats
Add `path` argument to base `Callback.save_to_file`. Make base
`Callback.load_from_file` final, to indicate that overriding it won't
do users any good. Take an initial stab at docstrings for the class.
Also fix broken test from last commit whoops
Add token granularity feature to CostPer*Token dimensions, and update
their `rate`->`rate_per_million` to better align with common orders
of magnitude
CostModel was not `await`ing the results of cost dimension
`calculate` coroutines, yielding unusable outputs.
Switch partial import circular dependency resolution around: In
general, callbacks depends on the rest of LLMeter and `Runner`'s
annotation of `Callback` types is the exception. This way around
should greatly reduce the number of places we have to put
`if TYPE_CHECKING` switches.
Flatten cost results onto Result & InvocationResponse classes so they
can be picked up by stats. Refactor cost model dimensions from list
to a map keyed by name (which needed a small adjustment to serde
`to_dict_recursive_generic` to accommodate dicts)
- Add cost calculation dimensions and modeling
- Integrate cost tracking with base endpoint
- Update runner and results to handle cost metrics
- Add test coverage for cost dimensions and model
Cost Callback now handles generating result-level summary stats for
response-level cost elements
Update cost callback model and results implementation
Modify core results functionality in llmeter package to enable callbacks to contribute custom stats
MLflow Callback Changes:
- Added automatic naming for nested runs

Results Processing:
added new private method `_update_contributed_stats()` to be used by callbacks to contribute stats to `result.stats`.

Dependencies:
- added mlflow as extra dependency
Push Result.stats-style formatting down into
`CalculatedCostWithDimensions.summary_statistics` so the extra layer
of transformation in CostModel can be removed. Also extend its
`.{load_from|save_to}_namespace()` to support dicts, to further
simplify stats preparation logic in CostModel.
Update CostModel to use `Result._update_contributed_stats` API
instead of manipulating the private `_contributed_stats` dict.
Replace previous ternary `include_request_costs` option in CostModel
`calculate_run_cost()` with a boolean `recalculate_request_costs`.
The case of *excluding* request-level costs from the run-level output
is no longer supported - as we expect it to be niche and potentially
confusing.
Duplicate dimension name checking in CostModel constructor was not
properly detecting duplicate dimension names within one dimension
type (run vs request).
Replace placeholders in MlflowCallback and Callback.load_from_file
with NotImplementedErrors specifying that they're still TODO. Hoist
`Callback` to `callbacks.__init__` level. Misc docstring and license
header updates
Start to split out Runner (for a group of runs) from Run (for a
single specific run). Includes some updates for tests, but many are
still broken. Run.output_path for now is still referring to the
overal group's output path, not the individual Run.
Fix broken tests with _Run+Runner refactor. Runner.run() now properly
awaits the returned _Run._run() coroutine. _Run now saves the run
config and doesn't throw error when output_path is None. Several unit
tests switched back from _Run to Runner, because they're testing
payload configuration which is set at the Runner level.
Change `_Run.output_path` to refer to the specific run's output
folder (i.e. no need to apply run name as suffix), while
`Runner.output_path` continues to refer to the overall run group's
output folder.
Automatic cost modelling for SageMaker Real-Time endpoints, including
example notebook mention.
Fix experiments for recent change of Runner.run() accepting payload
only as a keyword argument.
Rename integ test_sagemaker.py to avoid cache issues
Update before_run base classes, docstrings, and examples to reflect
that now it receives the individual run's _RunConfig (actually the
_Run object) - not a "Runner" which is notably different.
Add tests that _Run calls all expected callback hooks. It looks like
before_invoke is actually not implemented yet!
Comment on lines 25 to 33
async def before_invoke(self, payload: dict) -> None:
"""Lifecycle hook called before every `Endpoint.invoke()` request in a Run.

Args:
payload: The payload to be sent to the endpoint.
Returns:
None: If you'd like to modify the request `payload`, edit the dictionary in-place.
"""
pass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this particular hook not actually implemented yet: See recently added failing runner.py tests test_run_callback_triggered and test_run_callbacks_triggered_in_order.

I think the most natural place would be _Run._invoke_n_no_wait(), but:

  1. That's currently a sync method, which makes this tricky, and
  2. We probably also want to take a (deep?) copy of the payload before passing it through the string of callbacks - since callbacks could in principle modify the payload, but a test run may loop through the same source payload multiple times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a non-async implementation of before_invoke(), with a .copy() to protect the payload from side effects. I've added a comment in the code indicating how the current implementation might affect the timing test, and added a ToDo for replacing the current sync method with two separate processes: payload pre-processing and endpoint invocation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

evidently it did not work. I'm commenting the before_invoke(), I think it's better not to release it until the implementation is resolved.

llmeter/callbacks/mlflow.py Outdated Show resolved Hide resolved
Comment on lines +54 to +58
for key, value in stats.items():
if not isinstance(value, Number):
raise ValueError(
f"Value for key {key} must be a number, got {type(value)}"
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a limitation at the moment where (even though it's possible to re-analyze a result with a different CostModel and update the result using save=True), if the analysis changes what cost dimensions exist then there's no way for the cost model to find and delete the old ones.

For now I figure it's not a blocker but we should probably warn about it somewhere? This restricted update_contributed_stats API is part of the reason for the limitation, but there'd be some extra work needed to fix it even if this was loosened up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be able to update/complement arbitrary modifications we would need a comprehensive lineage/undo history. that sounds like overkill. possible alternative approach, cost (and any additional custom record-level entries) should be forced to live in a separate file with referencing records by index. this would allow multiple cost analysis on the same initial data, and they would all coexist.
Thoughts?

Comment on lines +548 to +574
Args:
endpoint (Endpoint | dict | None): The LLM endpoint to be tested. **Must be set** at
either the Runner or specific Run level.
output_path (os.PathLike | str | None): The (cloud or local) base folder under which
run outputs and configurations should be stored. By default, a new `run_name`
sub-folder will be created under the Runner's `output_path` if set - otherwise
outputs will not be saved to file.
tokenizer (Tokenizer | Any | None): Optional tokenizer used to estimate input and
output token counts for endpoints that don't report exact information.
clients (int): The number of concurrent clients to use for sending requests.
n_requests (int | None): The number of LLM invocations to generate *per client*.
payload (dict | list[dict] | os.PathLike | str | None): The request data to send to the
endpoint under test. You can provide a single JSON payload (dict), a list of
payloads (list[dict]), or a path to one or more JSON/JSON-Lines files to be loaded
by `llmeter.prompt_utils.load_payloads()`. **Must be set** at either the Runner or
specific Run level.
run_name (str | None): Name to use for a specific test Run. By default, runs are named
with the date and time they're requested in format: `%Y%m%d-%H%M`
run_description (str | None): A natural-language description for the test Run.
timeout (int | float): The maximum time (in seconds) to wait for each response from the
endpoint.
callbacks (list[Callback] | None): Optional callbacks to enable during the test Run. See
`llmeter.callbacks` for more information.
disable_per_client_progress_bar (bool): Set `True` to disable per-client progress bars
from showing during the run.
disable_clients_progress_bar (bool): Set `True` to disable overall progress bar from
showing during the run.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at patterns for re-using chunks between docstrings (for example HF Transformers makes extensive use of decorators to compose docstrings together) - but while these transformations get picked up by doc tools like sphinx and the built-in help() function, PyLance/VSCode tooltips doesn't seem able to see them :( So left everything as vanilla for now which means a bit of duplication & counter-intuitive placement in this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are our only tests that depend on an AWS API to run... But moto doesn't provide mocking for pricing API so not much alternative for testing the SageMaker pricing auto-discovery.

Created this new tests/integ folder, and I'd probably suggest moving our existing ones under tests/unit or tests/local or something, for clear separation? But didn't do so yet because it'd be painful for reviewing the diffs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after the initial callback release, let's move cost models and tests into a separate branch. i've the impression that complexity can balloon exponentially, and I'd rather contain it from spilling into the core functionalities.

@acere acere marked this pull request as ready for review December 18, 2024 21:54
@acere acere merged commit cada484 into main Dec 19, 2024
@acere acere deleted the callbacks branch December 19, 2024 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants