-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tracking] Callbacks, Cost Modelling, and MLFlow integration #10
Conversation
- Introduce new callback module with base and cost submodules - Add MLflow integration callback
- added nested option for MlflowCallback - revised typing for base Callback class - added callback mechanism to `LoadTest` - refactored callback mechanism to `Runner` and `run()` - added run start and end time to the run stats
Add `path` argument to base `Callback.save_to_file`. Make base `Callback.load_from_file` final, to indicate that overriding it won't do users any good. Take an initial stab at docstrings for the class.
Also fix broken test from last commit whoops
Add token granularity feature to CostPer*Token dimensions, and update their `rate`->`rate_per_million` to better align with common orders of magnitude
CostModel was not `await`ing the results of cost dimension `calculate` coroutines, yielding unusable outputs.
Switch partial import circular dependency resolution around: In general, callbacks depends on the rest of LLMeter and `Runner`'s annotation of `Callback` types is the exception. This way around should greatly reduce the number of places we have to put `if TYPE_CHECKING` switches.
Flatten cost results onto Result & InvocationResponse classes so they can be picked up by stats. Refactor cost model dimensions from list to a map keyed by name (which needed a small adjustment to serde `to_dict_recursive_generic` to accommodate dicts)
- Add cost calculation dimensions and modeling - Integrate cost tracking with base endpoint - Update runner and results to handle cost metrics - Add test coverage for cost dimensions and model
Cost Callback now handles generating result-level summary stats for response-level cost elements
Update cost callback model and results implementation Modify core results functionality in llmeter package to enable callbacks to contribute custom stats
MLflow Callback Changes: - Added automatic naming for nested runs Results Processing: added new private method `_update_contributed_stats()` to be used by callbacks to contribute stats to `result.stats`. Dependencies: - added mlflow as extra dependency
Push Result.stats-style formatting down into `CalculatedCostWithDimensions.summary_statistics` so the extra layer of transformation in CostModel can be removed. Also extend its `.{load_from|save_to}_namespace()` to support dicts, to further simplify stats preparation logic in CostModel.
Update CostModel to use `Result._update_contributed_stats` API instead of manipulating the private `_contributed_stats` dict.
Replace previous ternary `include_request_costs` option in CostModel `calculate_run_cost()` with a boolean `recalculate_request_costs`. The case of *excluding* request-level costs from the run-level output is no longer supported - as we expect it to be niche and potentially confusing.
Duplicate dimension name checking in CostModel constructor was not properly detecting duplicate dimension names within one dimension type (run vs request).
Replace placeholders in MlflowCallback and Callback.load_from_file with NotImplementedErrors specifying that they're still TODO. Hoist `Callback` to `callbacks.__init__` level. Misc docstring and license header updates
Start to split out Runner (for a group of runs) from Run (for a single specific run). Includes some updates for tests, but many are still broken. Run.output_path for now is still referring to the overal group's output path, not the individual Run.
Fix broken tests with _Run+Runner refactor. Runner.run() now properly awaits the returned _Run._run() coroutine. _Run now saves the run config and doesn't throw error when output_path is None. Several unit tests switched back from _Run to Runner, because they're testing payload configuration which is set at the Runner level.
Change `_Run.output_path` to refer to the specific run's output folder (i.e. no need to apply run name as suffix), while `Runner.output_path` continues to refer to the overall run group's output folder.
Automatic cost modelling for SageMaker Real-Time endpoints, including example notebook mention.
Fix experiments for recent change of Runner.run() accepting payload only as a keyword argument.
Rename integ test_sagemaker.py to avoid cache issues
Update before_run base classes, docstrings, and examples to reflect that now it receives the individual run's _RunConfig (actually the _Run object) - not a "Runner" which is notably different.
Add tests that _Run calls all expected callback hooks. It looks like before_invoke is actually not implemented yet!
llmeter/callbacks/base.py
Outdated
async def before_invoke(self, payload: dict) -> None: | ||
"""Lifecycle hook called before every `Endpoint.invoke()` request in a Run. | ||
|
||
Args: | ||
payload: The payload to be sent to the endpoint. | ||
Returns: | ||
None: If you'd like to modify the request `payload`, edit the dictionary in-place. | ||
""" | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this particular hook not actually implemented yet: See recently added failing runner.py tests test_run_callback_triggered
and test_run_callbacks_triggered_in_order
.
I think the most natural place would be _Run._invoke_n_no_wait()
, but:
- That's currently a sync method, which makes this tricky, and
- We probably also want to take a (deep?) copy of the payload before passing it through the string of callbacks - since callbacks could in principle modify the payload, but a test run may loop through the same source payload multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a non-async implementation of before_invoke()
, with a .copy()
to protect the payload from side effects. I've added a comment in the code indicating how the current implementation might affect the timing test, and added a ToDo for replacing the current sync method with two separate processes: payload pre-processing and endpoint invocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
evidently it did not work. I'm commenting the before_invoke()
, I think it's better not to release it until the implementation is resolved.
for key, value in stats.items(): | ||
if not isinstance(value, Number): | ||
raise ValueError( | ||
f"Value for key {key} must be a number, got {type(value)}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a limitation at the moment where (even though it's possible to re-analyze a result with a different CostModel
and update the result using save=True
), if the analysis changes what cost dimensions exist then there's no way for the cost model to find and delete the old ones.
For now I figure it's not a blocker but we should probably warn about it somewhere? This restricted update_contributed_stats
API is part of the reason for the limitation, but there'd be some extra work needed to fix it even if this was loosened up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be able to update/complement arbitrary modifications we would need a comprehensive lineage/undo history. that sounds like overkill. possible alternative approach, cost (and any additional custom record-level entries) should be forced to live in a separate file with referencing records by index. this would allow multiple cost analysis on the same initial data, and they would all coexist.
Thoughts?
Args: | ||
endpoint (Endpoint | dict | None): The LLM endpoint to be tested. **Must be set** at | ||
either the Runner or specific Run level. | ||
output_path (os.PathLike | str | None): The (cloud or local) base folder under which | ||
run outputs and configurations should be stored. By default, a new `run_name` | ||
sub-folder will be created under the Runner's `output_path` if set - otherwise | ||
outputs will not be saved to file. | ||
tokenizer (Tokenizer | Any | None): Optional tokenizer used to estimate input and | ||
output token counts for endpoints that don't report exact information. | ||
clients (int): The number of concurrent clients to use for sending requests. | ||
n_requests (int | None): The number of LLM invocations to generate *per client*. | ||
payload (dict | list[dict] | os.PathLike | str | None): The request data to send to the | ||
endpoint under test. You can provide a single JSON payload (dict), a list of | ||
payloads (list[dict]), or a path to one or more JSON/JSON-Lines files to be loaded | ||
by `llmeter.prompt_utils.load_payloads()`. **Must be set** at either the Runner or | ||
specific Run level. | ||
run_name (str | None): Name to use for a specific test Run. By default, runs are named | ||
with the date and time they're requested in format: `%Y%m%d-%H%M` | ||
run_description (str | None): A natural-language description for the test Run. | ||
timeout (int | float): The maximum time (in seconds) to wait for each response from the | ||
endpoint. | ||
callbacks (list[Callback] | None): Optional callbacks to enable during the test Run. See | ||
`llmeter.callbacks` for more information. | ||
disable_per_client_progress_bar (bool): Set `True` to disable per-client progress bars | ||
from showing during the run. | ||
disable_clients_progress_bar (bool): Set `True` to disable overall progress bar from | ||
showing during the run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at patterns for re-using chunks between docstrings (for example HF Transformers makes extensive use of decorators to compose docstrings together) - but while these transformations get picked up by doc tools like sphinx and the built-in help()
function, PyLance/VSCode tooltips doesn't seem able to see them :( So left everything as vanilla for now which means a bit of duplication & counter-intuitive placement in this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are our only tests that depend on an AWS API to run... But moto doesn't provide mocking for pricing API so not much alternative for testing the SageMaker pricing auto-discovery.
Created this new tests/integ
folder, and I'd probably suggest moving our existing ones under tests/unit
or tests/local
or something, for clear separation? But didn't do so yet because it'd be painful for reviewing the diffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after the initial callback release, let's move cost models and tests into a separate branch. i've the impression that complexity can balloon exponentially, and I'd rather contain it from spilling into the core functionalities.
Issue #, if available: closes #8, closes #9
Description of changes:
This PR introduces a general callbacks mechanism to LLMeter, and implements two new features using it: Cost modelling, and MLFlow experiment tracking.
A callback is (similar to in HF Transformers) a class that may implement multiple available lifecycle hooks that LLMeter will call during a test Run.
MlflowCallback
uses only theafter_run()
hook to log experiment details to the active MLFlow environment, whileCostModel
s also usebefore_run()
and request-level hooks to estimate request-level and run-level costs. Since pricing is generally complicated for a wide variety of reasons (including volume discounts, private pricing, capacity reservations, etc etc), our cost model solution prioritizes giving users flexible tools to define their own models of request- and run-level costs - with space for providing automatic, end-to-end estimates in future (llmeter.callbacks.cost.providers
) but caution on implementing them.BREAKING CHANGES:
payload
(and all other arguments) ofRunner.run()
must now be passed as keyword arguments (i.e.payload=...
). This is to avoid potential reliance on argument sequencing, becauserun()
can override all attributes ofRunner
(a large number) and there's nothing intrinsically special aboutpayload
coming first in this context.Docs & tests checklist
Exclusions for future work
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.