23 Aug 01:00

cloudw

7942609

zg-2.3 Latest

Latest

What's Changed

AIP-8434 resilient_flow.py fix by @talebzeghmi in #297
increase test parallelization 3->7 by @talebzeghmi in #298
AIP-8440 retry opsgenie test intermittent error by @talebzeghmi in #299
AIP-8457 remove unnecessary kfp dependencies by @talebzeghmi in #300
AIP-8570 remove KFP use of requests_toolbelt by @talebzeghmi in #301
AIP-8684 PEP440 compliant version by @talebzeghmi in #304
Adding ability to run latest argo workflow template by @cloudw in #303
[minor][bugfix] Remove redundant generateName by @cloudw in #305
AIP-8704 add deploy default behavior by @cavandervoort in #307
Bumping the minor version to 2.3 by @cloudw in #308

Full Changelog: zg-2.2...zg-2.3

Contributors

talebzeghmi, cloudw, and cavandervoort

Assets 2

22 Mar 22:51

talebzeghmi

zg-2.2

18ad4df

zg-2.2

What's Changed

AIP-8098 ArgoUI render multiple card artifacts by @talebzeghmi in #287
AIP-8124 ExitHandler doesn't notify on Workflow Error by @talebzeghmi in #289
AIP-8095 perm argo-ui link w/ uid by @talebzeghmi in #291
AIP-8056 remove ReadWriteMany volume by @talebzeghmi in #292
AIP-8163 Remove KFP annotations & labels from Argo YAML by @talebzeghmi in #293
AIP-8176 PVC & Sensor retry count 7->3 by @talebzeghmi in #294
AIP-8179 metaflow_version tag and card fix by @talebzeghmi in #295

Full Changelog: zg-2.1...zg-2.2

Contributors

talebzeghmi

Assets 2

15 Feb 18:50

talebzeghmi

zg-2.1

ce30b72

zg-2.1

What's Changed

AIP-7299 kfp run to use Argo submit by @talebzeghmi in #236
AIP-7314 Argo Flow Trigger Flow by @talebzeghmi in #238
AIP-7339 Recurring Workflow by @talebzeghmi in #240
AIP-7339 parameterize name of CronWorkflow and ConfigMap by @talebzeghmi in #251
Re-disable foreach integration tests by @talebzeghmi in #256
AIP-7418 Remove KFP cluster create by @talebzeghmi in #257
AIP-7784 validate MF tags w/ k8s label rules by @talebzeghmi in #265
AIP-7802 SemVer: WFSDK v2 for kfp-argo by @talebzeghmi in #266
AIP-7796 Embed zillow-kfp compiler into WFSDK by @talebzeghmi in #267
Remove Github Metaflow full suite of tests by @talebzeghmi in #270
AIP-7837 feature/kfp-argo -> feature/aip by @talebzeghmi in #269
--- This would have been the 2.0 release ---
AIP-7772 re-disable nested foreach tests by @talebzeghmi in #271
AIP-7888 relax urrlib3 & pyyaml pins by @talebzeghmi in #272
AIP-7511 exit_handler DAG - Part 1 by @talebzeghmi in #273
AIP-7511 user defined @exit_handler Part 2/2 by @talebzeghmi in #274
AIP-7487 join memory fix by @talebzeghmi in #276
Add batch component label to Workflow resource by @tmckay in #275
AIP-7985 minutes between retries bug by @talebzeghmi in #277
ref CICD hash commit before WORKING_DIR by @talebzeghmi in #279
AIP-8051 Fix AIP Integration tests by @talebzeghmi in #281
AIP-8056 disable volume_mode="ReadWriteMany" test by @talebzeghmi in #282
sanitize_k8s_name workflow template name by @talebzeghmi in #283
AIP-8072 set retry "2m" not 2 by @talebzeghmi in #284
AIP-8089 use unique {{workflow.uid}} by @talebzeghmi in #285

Full Changelog: 1.3.2409+2.5.4...zg-2.1

Contributors

tmckay and talebzeghmi

Assets 2

01 Nov 17:55

talebzeghmi

1.3.2409+2.5.4

22c3f21

zg-1.3

Tagging last version of 1.3 before a 2.0 major feature/aip major version update.

What's Changed

Adding sys_tag as parameters by @cloudw in #175
Bugfix: Colon can be used in tags now. by @cloudw in #176
Bugfix: sys tag not work properly when used alone. by @cloudw in #177
Publish now depends on cicd var by @cloudw in #178
Support shared memory in KFP plug-in by @cloudw in #184
AIP-5874 update failure flow assertion by @aaron-arellano in #191
Moving .gitlab-ci.yml to root dir by @cloudw in #196
AIP-6493 zodiac_owner label & tag by @talebzeghmi in #199
k8s labels zodiac owner w/o "@" alias only by @talebzeghmi in #203
AIP-6601 ResilientFlow fix to be more resilient by @talebzeghmi in #204
AIP-6595: Add in Sandbox stage for Metaflow integration tests by @alexlatchford in #205
USER=$GITLAB_USER_EMAIL fix by @talebzeghmi in #206
AIP-66002 s3_sensor mem & k8s platform error by @talebzeghmi in #207
AIP-6602 s3_sensor Retry policy="Always" & reduce mem by @talebzeghmi in #208
Tz/AIP-6604-accelerator-type-none by @talebzeghmi in #209
AIP-6642-s3-sensor-test-retry by @talebzeghmi in #210
AIP-6522 IAM role per workflow by @aaron-arellano in #197
AIP-6522: IAM role per workflow by @aaron-arellano in #214
AIP-6643 kfp wait to argo wait by @talebzeghmi in #216
AIP-6643 DEPLOY_INTERNAL: "true" by @talebzeghmi in #217
Bump version to 1.3.x by @cloudw in #212
Deprecate KFP preceding component by @cloudw in #211
AIP-6514 local storage deprecation by @cloudw in #198
Release/1.3 by @cloudw in #219
AIP-6717 disable foreach tests by @talebzeghmi in #220
AIP-6339: Zodiac service per nb by @aaron-arellano in #218
AIP-6693 Create PVC just before the step by @talebzeghmi in #221
AIP-6788 Optional PVC volume_type by @talebzeghmi in #223
AIP-6753 @interruptible decorator by @talebzeghmi in #222
AIP-6884 kfp-pod-default label on all KFP steps by @talebzeghmi in #225
Update label to node.k8s.zgtools.net/capacity-type by @tmckay in #227
Tz/AIP-6887-annotate-all-pods by @talebzeghmi in #228
AIP-6950 Checkpoint SDK by @talebzeghmi in #229
current.task_log_location() & checkpoint os.listdir fix by @talebzeghmi in #230
Pin min zillow-kfp version to speed up version resolution by @cloudw in #231
AIP-7181 log code path by @talebzeghmi in #233
AIP-7275 Save capacity-type, host, instance-type to MF metadata by @talebzeghmi in #234
Add yaml output format for Argo Workflow and WorkflowTemplate by @cloudw in #235
Argo: Remove service account; Use static workflow name by @cloudw in #239
Remove copy of .kube because of ci-cd-template changes by @tmckay in #244
AIP-7497 set retry on exit_handler by @talebzeghmi in #246
send messages to SQS exit-handler step in workflow if configured by @xiaowei-zillow in #247
sqs_message_json DLQ param fix by @xiaowei-zillow in #248
AIP-7702 AIP-7714 PVC retry and @Retry(minutes_between_retries) by @talebzeghmi in #255
AIP-7738 retry_backoff_factor feature by @talebzeghmi in #264

New Contributors

@aaron-arellano made their first contribution in #191
@tmckay made their first contribution in #227
@xiaowei-zillow made their first contribution in #247

Full Changelog: zg-1.2...1.3.2409+2.5.4

Contributors

tmckay, alexlatchford, and 5 other contributors

Assets 2

29 Apr 17:43

cloudw

zg-1.2

6a1ffd0

zg-1.2

To use this version you need build 1.2.1418+2.5.4 or above.

Main Changes

Upstream Merge
- Pulling in all features until Metaflow 2.5.4
- Notably support for @card is added to visualize results . See related docs for more details.
Features
- Flow can now trigger downstream pipelines uploaded to KFP (#150)
- metaflow.S3 tmproot default to PVC (#169)
Compatibility Fix
- Fix compatibility issue with argo 1.5.0 (#174)

What's Changed

AIP-5324 JSON parameters by @talebzeghmi in #149
Update black run to use python 3.9 by @cloudw in #152
Switching to using Github container registry for base image by @hsezhiyan in #153
Flow triggering flow by @cloudw in #150
Automate flow triggering flow test by @cloudw in #156
Add zillow-kfp and kfp-server-api to default image by @cloudw in #157
Merge 2.5.4 fixes by @cloudw in #165
Merge 2 5 4 show conflicts by @cloudw in #168
AIP-6005 Use PVC as tmp path in S3() by @talebzeghmi in #169
Raise value error on tags that are too long by @cloudw in #142
Use common image config in KFP by @cloudw in #172
Add back unit test coverage by @cloudw in #171
Rename KFP_CONTAINER_IMAGE to KFP_DEFAULT_CONTAINER_IMAGE by @cloudw in #174
Fix env var name for argo wf name by @cloudw in #173
Add service / k8s tags as system tags by @cloudw in #170

Full Changelog: zg-1.0...zg-1.2

Contributors

talebzeghmi, cloudw, and 2 other contributors

Assets 2

29 Apr 17:29

cloudw

zg-1.0

a42dc9e

zg-1.0

Main Changes

Bugfixes for @s3_sensor, PLEG Stability and node utilization issues (using high-memory toleration)
Support for pytest coverage
Switch to ZG version schema, so that our internal breaking changes are reflected in version number

What's Changed

AIP-4600: Refactor Gitlab CI pipeline to include publishing lib to Artifactory by @alexlatchford in #115
@kfp(image=) to support customers who want to specify image per step by @hsezhiyan in #120
AIP-4600: Relax pylint version by @alexlatchford in #125
AIP-5183 - Fixing regression in @s3_sensor by @hsezhiyan in #126
Pod toleration based on CPU and memory by @cloudw in #129
AIP-5103: Swap over feature branches to use dev releases versioning scheme by @alexlatchford in #128
AIP-5103: Move to leverage the aip-py-cpu base image and remedy Python build errors by @alexlatchford in #132
AIP-5283 - Fix @s3_sensor usage with @resources(volume=...) and --notify by @hsezhiyan in #131
Support pytest coverage of customer Flows by @talebzeghmi in #127
METAFLOW_COVERAGE_OMIT check for None by @talebzeghmi in #135
AIP-5330 set default retry policy="Always" (even on PodDeletion) by @talebzeghmi in #134
Handle None value of COVERAGE_OMIT by @cloudw in #140
AIP-5068 - Reduce PLEG Stability Issues by @hsezhiyan in #137
Use "purpose: high-memory" toleration instead of "instance-type: r5.12xlarge" by @cloudw in #143
AIP-5333 - @s3_sensor resilient to failures by @hsezhiyan in #146

Full Changelog: 2.3.2+zg2.0...zg-1.0

Contributors

alexlatchford, kfp, and 3 other contributors

Assets 2

15 Sep 23:26

cloudw

2.3.2+zg2.0

6afc8f9

Workflow SDK Release 2.3.2+zg2.0

The Workflow SDK 2.3.2+zg2.0 release is a major release.

Release Summary

Upstream Merge:
- Pulling in all features until Metaflow 2.3.2
Features
- (Breaking Change) Enforcing Guaranteed Quality of Service for Pods in KFP plugin (#118 )
- Stream logging enabled for KFP plugin
- Allow attached volume shared across split nodes
- Default pod labels for more detailed ZGCP costs ledger(#90 #92 #94)

Features

Breaking Change - Enforcing Guaranteed Quality of Service for Pods in KFP plugin
Pods that have limits way larger than requests have been a problem for cluster stability. In extreme cases hosts may have total burstable resource limits 50 times more than what's available. To resolve this issue we are trying to enforce Guaranteed QoS across the board.

In the Workflow SDK, cpu_limit, memory_limit, and local_storage_limit have been removed from @resource decorator. Users can only provide single values for cpu, memory, or local_storage, and both requests and limits will be set to the same value.

In Spark integration (spark related code change not in this repo):

If the user provides limits.cpu
- If requests.cpu are also provided, limits.cores and requests.cores MUST have same values or ValueError will be raised
- If requests.cpu are NOT provided, limits.cpu will be used as the requests.cpu as well.
If the user does not provide limits.cpu
- If requests.cpu are provided, limits.cpu = requests.cpu
- elif the user provides "spark.executor.cores", the value will be used for both limits.cpu and requests.cpu
- else limits.cpu and requests.cpu will be set to default "spark.executor.cores" which is 1

Stream logging for KFP plug-in
This is a feature pulled from upstream Metaflow version 2.2.10 and adapted for KFP plugin. Several changes:

Logs are published to datastore via a sidecar process periodically. For KFP plugin logs used to be available in datastore only when the step finishes.
You may access logs using python flow.py <run-id>/<step-name>
- For retried steps, only logs from last retry will be printed. All logs are available in datastore.

Allow sharing attached volume across split nodes
By specifying @resources(volume_mode="ReadWriteMany", volume=<desired amount>), attached volume will be shared across split nodes of the same step.

Default pod labels for more detailed ZGCP costs ledger(#90 #92 #94)
By default pods are now labeled for their experiment, flow and step name for more detailed cost tracking

Changes from upstream

Here is a partial list of changes from upstream that are applicable to ZG AI Platform.
For the full change list please see release notes from 2.2.5 to 2.3.2

Features

Performance optimizations for merge_artifacts
Execution logs are now available for all tasks in Metaflow universe

Bug Fix

Handle regression with Click >=8.0.x
Fix regression with ping/ endpoint for Metadata service
Fix the behaviour of --namespace= CLI args when executing a flow
Remove pinned pylint dependency
List custom FlowSpec parameters in the intended order
Fix @environment behavior for conflicting attribute values
Fix environment is not callable error when using @environment
Pipe all workflow set-up logs to stderr
Handle null assignment to IncludeFile properly

Assets 2

14 Jun 17:50

talebzeghmi

2.2.5+zg1.1

6101ef0

2.2.5+zg1.1

Workflow SDK Release `2.2.5+zg1.1`

Release Summary:

Support Persistent Volume Claim (PVC) in @resource decorator #81
- Please use if your step needs any disk space
- PyTorchDistributedDecorator is deprecated #88
Support P3 GPU instance #83
Improve Zodiac integration and cost tracking - automatic pod labeling for zodiac_service, zodiac_team #86
Improve Datadog integration - automatic labeling for flow name, experiment name, run id, and step name #80
Metadata reporting fix in CICD #86

Support for Persistent Volume Claim (PVC)

To use disk space, you can now specify persistent volume in @resource decorator per step. It is as simple as

@resources(volume="30G")
@step
def my_task():
    ...

By default the volume is mounted to /opt/metaflow_volume, and this volume is only available for the step decorated. If @retry is used, the volume will be shared across retries of this step - nice if you want to pick up from previous progress, and be sure to clean up otherwise.

You have options to customize PVC mount path, or make the volume available to all steps onwards. Two additional attributes volume_dir and volume_mode are needed:

@resources(volume="30G", volume_mode="ReadWriteMany", volume_dir=<your_preferred_path>)
@step
def my_task():
    ...

Refer to doc string here for more details.

PyTorchDistributedDecorator (@pytorch_distributed) is deprecated due to implementation similarity.

P3 GPU Instance Support

We are adding an option for P3 instance when a more powerful GPU is handy - introducing @accelerator decorator!

@accelerator sets the taints and node label for your steps. To request P3 instance:

@accelerator(type="nvidia-tesla-v100")
@resources(...)
@step
def my_task():
    ...

While other instances can be requested similarly in the future, additional work is needed to support each type. Please let us (aip teams) know if other unsupported instance types suit your use cases better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

Main Changes

What's Changed

Contributors

Main Changes

What's Changed

Contributors

Release Summary

Features

Changes from upstream

Workflow SDK Release `2.2.5+zg1.1`

Release Summary:

Support for Persistent Volume Claim (PVC)

P3 GPU Instance Support

Improve Zodiac integration and cost tracking

Improve Datadog integration

Metadata reporting fix in CICD

Releases: zillow/metaflow

zg-2.3

What's Changed

Contributors

zg-2.2

What's Changed

Contributors

zg-2.1

What's Changed

Contributors

zg-1.3

What's Changed

New Contributors

Contributors

zg-1.2

Main Changes

What's Changed

Contributors

zg-1.0

Main Changes

What's Changed

Contributors

Workflow SDK Release 2.3.2+zg2.0

Release Summary

Features

Changes from upstream

2.2.5+zg1.1

Workflow SDK Release 2.2.5+zg1.1

Release Summary:

Support for Persistent Volume Claim (PVC)

P3 GPU Instance Support

Improve Zodiac integration and cost tracking

Improve Datadog integration

Metadata reporting fix in CICD

Workflow SDK Release `2.2.5+zg1.1`