Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto Cache Plugin #2971

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

Auto Cache Plugin #2971

wants to merge 17 commits into from

Conversation

dansola
Copy link
Contributor

@dansola dansola commented Dec 2, 2024

Why are the changes needed?

Make caching easier to use in flytekit by reducing cognitive burden of specifying cache versions

What changes were proposed in this pull request?

To use the caching mechanism in a Flyte task, you can define a CachePolicy that combines multiple caching strategies. Here’s an example of how to set it up:

from flytekit import task
from flytekit.core.auto_cache import CachePolicy
from flytekitplugins.auto_cache import CacheFunctionBody, CachePrivateModules

cache_policy = CachePolicy(
    auto_cache_policies = [
        CacheFunctionBody(),
        CachePrivateModules(root_dir="../my_package"),
        ...,
    ]
    salt="my_salt"
)

@task(cache=cache_policy)
def task_fn():
    ...

Salt Parameter

The salt parameter in the CachePolicy adds uniqueness to the generated hash. It can be used to differentiate between different versions of the same task. This ensures that even if the underlying code remains unchanged, the hash will vary if a different salt is provided. This feature is particularly useful for invalidating the cache for specific versions of a task.

Cache Implementations

Users can add any number of cache policies that implement the AutoCache protocol defined in @auto_cache.py. Below are the implementations available so far:

1. CacheFunctionBody

This implementation hashes the contents of the function of interest, ignoring any formatting or comment changes. It ensures that the core logic of the function is considered for versioning.

2. CacheImage

This implementation includes the hash of the container_image object passed. If the image is specified as a name, that string is hashed. If it is an ImageSpec, the parametrization of the ImageSpec is hashed, allowing for precise versioning of the container image used in the task.

3. CachePrivateModules

This implementation recursively searches the task of interest for all callables and constants used. The contents of any callable (function or class) utilized by the task are hashed, ignoring formatting or comments. The values of the literal constants used are also included in the hash.

It accounts for both import and from-import statements at the global and local levels within a module or function. Any callables that are within site-packages (i.e., external libraries) are ignored.

4. CacheExternalDependencies

This implementation recursively searches through all the callables like CachePrivateModules, but when an external package is found, it records the version of the package, which is included in the hash. This ensures that changes in external dependencies are reflected in the task's versioning.

How was this patch tested?

Unit tests for the following:

  • verifying a function hash changes only when function contents change, not when formatting or comments are added
  • verify that a dummy repository can be recursively searched when various import statements are used
  • verify that functions not used by the task of interest are not hashed
  • verify that the all constants used by a task are and any of the functions it calls are identified
  • verify that in a new python environment, the correct external libraries are identified
  • verify that the correct dependency versions can be identified

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

@@ -132,9 +133,9 @@ def task(

@overload
def task(
_task_function: Callable[P, FuncOut],
_task_function: Callable[..., FuncOut],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change P to ...?

Comment on lines +65 to +67
self.cache_serialize = cache_serialize
self.cache_version = cache_version
self.cache_ignore_input_vars = cache_ignore_input_vars
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of saving this state here? aren't these just forwarded to the underlying TaskMetadata?

@@ -95,7 +96,7 @@ def find_pythontask_plugin(cls, plugin_config_type: type) -> Type[PythonFunction
def task(
_task_function: None = ...,
task_config: Optional[T] = ...,
cache: bool = ...,
cache: Union[bool, CachePolicy] = ...,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this accept any AutoCache-compliant object?

Basically the user can provide just a single autocache object like CacheFunctionBody or compose multiple into a CachePolicy, but users should be forced to always use a CachePolicy object.

Comment on lines +350 to +357
cache_version_val = cache_version or cache.get_version(params=params)
cache_serialize_val = cache_serialize or cache.cache_serialize
cache_serialize_val = cache_ignore_input_vars or cache.cache_ignore_input_vars
else:
cache_val = cache
cache_version_val = cache_version
cache_serialize_val = cache_serialize
cache_ignore_input_vars_val = cache_ignore_input_vars
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of forwarding all of these parameters via the CachePolicy object? It doesn't look like it's being modified there.

Comment on lines +20 to +27
cache_policy = CachePolicy(
auto_cache_policies = [
CacheFunctionBody(),
CachePrivateModules(root_dir="../my_package"),
...,
]
salt="my_salt"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also provide an example of not needing to provide a CachePolicy object, e.g. just a passing in CacheFunctionBody.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants