Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMF fluent API + Ray runner [collecting feedback] #68

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

sergey-serebryakov
Copy link
Contributor

@sergey-serebryakov sergey-serebryakov commented Dec 15, 2022

Introduction

This PR implements one possible version of what a CMF fluent API can look like. It tries to achieve the following goals:

  • Remove some rarely used features from public API (such as typed parameters for pipelines and steps).
  • Automatically create steps if none are present when users call fluent API (e.g., log_dataset).
  • Initialize CMF in different usage contexts, for instance, retrieve initialization parameters from environment variables.
  • Automatically identify artifact association with steps (input/output) in certain usage scenarios.

Example

Assuming a user has developed four functions - fetch, preprocess, train and test, the following is the example of CMF fluent API:

import cmflib.contrib.fluent as cmf

cmf.set_cmf_parameters(filename='mlmd', graph=False)
for step in (fetch, preprocess, train, test):
    with cmf.start_step(pipeline='my_pipeline', step=step.__name__):
        step()

API methods

Fluent API methods are categorized into three buckets:

  • Set CMF parameters (set_cmf_parameters). These parameters control CMF initialization, and do not include information about pipelines, steps and executions.
  • Start/end steps (start_step and end_step). These methods start a new pipeline step and ends currently active pipeline steps. The start_step method returns an instance of the Step class that can be used as a python context manager to automatically end steps.
  • Logging methods (log_dataset, log_dataset_with_version, log_model, log_execution_metrics, log_metric and log_validation_output). These methods log input/output artifacts. When these methods accept artifact URL, users can provide a string or a Path object, e.g.:
    ds_path = _workspace / 'iris.pkl'
    with open(ds_path , 'rb') as stream: 
        dataset: t.Dict = pickle.load(stream) 
    cmf.log_dataset(ds_path , 'input') 
    All these methods will create a new step of one does not present.

Ray runner

This PR also contains an example of how CMF pipelines run on Ray clusters. This is possible since the fluent API can initialize the CMF using environment variables:

    pipeline_env = {
        'CMF_FLUENT_INIT_METHOD': 'env',
        'CMF_FLUENT_CMF_PARAMS': json.dumps({'filename': mlmd_store.as_posix(), 'graph': False}),
        'CMF_FLUENT_PIPELINE': 'iris',
        'CMF_FLUENT_STEP': None
    }

    for step in (fetch, preprocess, train, test):
        step_env = pipeline_env.copy()
        step_env['CMF_FLUENT_STEP'] = step.remote.__name__
        ref: ray.ObjectRef = step.options(runtime_env={'env_vars': step_env}).remote()
        ray.get(ref)

This commit implements one possible version of what a CMF fluent API can look like. It tries to achieve the following
goals:

   - Remove some rarely used features from public API (such as typed parameters for pipelines and steps).
   - Automatically create steps if none are present when users call fluent API (e.g., `log_dataset`).
   - Initialize CMF in different usage contexts, for instance, retrieve initialization parameters from environment
     variables.
   - Automatically identify artifact association with steps (consumed/produced) in certain usage scenarios.

## Example

Assuming a user has developed four functions - `fetch`, `preprocess`, `train` and `test`, the following is the example
of CMF fluent API:

```python
import cmflib.contrib.fluent as cmf

cmf.set_cmf_parameters(filename='mlmd', graph=False)
for step in (fetch, preprocess, train, test):
    with cmf.start_step(pipeline='my_pipeline', step=step.__name__):
        step()
```

## API methods

Methods can be categorized into three buckets:
- Set CMF parameters (`set_cmf_parameters`). These parameters control CMF initialization, and do not include information
  about pipelines, steps and executions.
- Start/end steps (`start_step` and `end_step`). These methods start a new pipeline step and ends currently active
  pipeline steps. The `start_step` method returns an instance of the `Step` class that can be used as a python context
  manager to automatically end steps.
- Logging methods (`log_dataset`, `log_dataset_with_version`, `log_model`, `log_execution_metrics`, `log_metric` and
  `log_validation_output`). These methods log input/output artifacts. When these methods accept artifact URL, users
  can provide file system object instead (e.g., the one returned by `builtins.open` function). In this case,
  the association (input/output) is identified automatically, e.g.:
  ```python
  with open(_workspace / 'iris.pkl', 'rb') as stream:
      dataset: t.Dict = pickle.load(stream)
      cmf.log_dataset(stream)
  ```
  All these methods will create a new step of one does not present.
@sergey-serebryakov sergey-serebryakov self-assigned this Dec 15, 2022
@sergey-serebryakov sergey-serebryakov added the enhancement New feature or request label Dec 15, 2022
- Disabling support for file objects in logging methods (happens to be a bad idea to commit open files).
- Splitting example into a pipeline definition and pipeline runners.
- Adding example that shows how to run CMF pipelines with fluent API on ray cluster.
@sergey-serebryakov sergey-serebryakov changed the title CMF fluent API [WIP - collecting feedback] CMF fluent API + Ray runner [WIP - collecting feedback] Dec 19, 2022
… (local / ray)

```python
python pipeline.py -e local
python pipeline.py -e ray
```
@sergey-serebryakov sergey-serebryakov changed the title CMF fluent API + Ray runner [WIP - collecting feedback] CMF fluent API + Ray runner [collecting feedback] Jun 27, 2023
@sergey-serebryakov sergey-serebryakov marked this pull request as draft June 27, 2023 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant