Releases: zillow/metaflow
zg-2.3
What's Changed
- AIP-8434 resilient_flow.py fix by @talebzeghmi in #297
- increase test parallelization 3->7 by @talebzeghmi in #298
- AIP-8440 retry opsgenie test intermittent error by @talebzeghmi in #299
- AIP-8457 remove unnecessary kfp dependencies by @talebzeghmi in #300
- AIP-8570 remove KFP use of requests_toolbelt by @talebzeghmi in #301
- AIP-8684 PEP440 compliant version by @talebzeghmi in #304
- Adding ability to run latest argo workflow template by @cloudw in #303
- [minor][bugfix] Remove redundant generateName by @cloudw in #305
- AIP-8704 add deploy default behavior by @cavandervoort in #307
- Bumping the minor version to 2.3 by @cloudw in #308
Full Changelog: zg-2.2...zg-2.3
zg-2.2
What's Changed
- AIP-8098 ArgoUI render multiple card artifacts by @talebzeghmi in #287
- AIP-8124 ExitHandler doesn't notify on Workflow Error by @talebzeghmi in #289
- AIP-8095 perm argo-ui link w/ uid by @talebzeghmi in #291
- AIP-8056 remove ReadWriteMany volume by @talebzeghmi in #292
- AIP-8163 Remove KFP annotations & labels from Argo YAML by @talebzeghmi in #293
- AIP-8176 PVC & Sensor retry count 7->3 by @talebzeghmi in #294
- AIP-8179 metaflow_version tag and card fix by @talebzeghmi in #295
Full Changelog: zg-2.1...zg-2.2
zg-2.1
What's Changed
- AIP-7299 kfp run to use Argo submit by @talebzeghmi in #236
- AIP-7314 Argo Flow Trigger Flow by @talebzeghmi in #238
- AIP-7339 Recurring Workflow by @talebzeghmi in #240
- AIP-7339 parameterize name of CronWorkflow and ConfigMap by @talebzeghmi in #251
- Re-disable foreach integration tests by @talebzeghmi in #256
- AIP-7418 Remove KFP cluster create by @talebzeghmi in #257
- AIP-7784 validate MF tags w/ k8s label rules by @talebzeghmi in #265
- AIP-7802 SemVer: WFSDK v2 for kfp-argo by @talebzeghmi in #266
- AIP-7796 Embed zillow-kfp compiler into WFSDK by @talebzeghmi in #267
- Remove Github Metaflow full suite of tests by @talebzeghmi in #270
- AIP-7837 feature/kfp-argo -> feature/aip by @talebzeghmi in #269
--- This would have been the2.0
release --- - AIP-7772 re-disable nested foreach tests by @talebzeghmi in #271
- AIP-7888 relax urrlib3 & pyyaml pins by @talebzeghmi in #272
- AIP-7511 exit_handler DAG - Part 1 by @talebzeghmi in #273
- AIP-7511 user defined @exit_handler Part 2/2 by @talebzeghmi in #274
- AIP-7487 join memory fix by @talebzeghmi in #276
- Add batch component label to Workflow resource by @tmckay in #275
- AIP-7985 minutes between retries bug by @talebzeghmi in #277
- ref CICD hash commit before WORKING_DIR by @talebzeghmi in #279
- AIP-8051 Fix AIP Integration tests by @talebzeghmi in #281
- AIP-8056 disable volume_mode="ReadWriteMany" test by @talebzeghmi in #282
- sanitize_k8s_name workflow template name by @talebzeghmi in #283
- AIP-8072 set retry "2m" not 2 by @talebzeghmi in #284
- AIP-8089 use unique {{workflow.uid}} by @talebzeghmi in #285
Full Changelog: 1.3.2409+2.5.4...zg-2.1
zg-1.3
Tagging last version of 1.3
before a 2.0
major feature/aip
major version update.
What's Changed
- Adding sys_tag as parameters by @cloudw in #175
- Bugfix: Colon can be used in tags now. by @cloudw in #176
- Bugfix: sys tag not work properly when used alone. by @cloudw in #177
- Publish now depends on cicd var by @cloudw in #178
- Support shared memory in KFP plug-in by @cloudw in #184
- AIP-5874 update failure flow assertion by @aaron-arellano in #191
- Moving .gitlab-ci.yml to root dir by @cloudw in #196
- AIP-6493 zodiac_owner label & tag by @talebzeghmi in #199
- k8s labels zodiac owner w/o "@" alias only by @talebzeghmi in #203
- AIP-6601 ResilientFlow fix to be more resilient by @talebzeghmi in #204
- AIP-6595: Add in Sandbox stage for Metaflow integration tests by @alexlatchford in #205
- USER=$GITLAB_USER_EMAIL fix by @talebzeghmi in #206
- AIP-66002 s3_sensor mem & k8s platform error by @talebzeghmi in #207
- AIP-6602 s3_sensor Retry policy="Always" & reduce mem by @talebzeghmi in #208
- Tz/AIP-6604-accelerator-type-none by @talebzeghmi in #209
- AIP-6642-s3-sensor-test-retry by @talebzeghmi in #210
- AIP-6522 IAM role per workflow by @aaron-arellano in #197
- AIP-6522: IAM role per workflow by @aaron-arellano in #214
- AIP-6643 kfp wait to argo wait by @talebzeghmi in #216
- AIP-6643 DEPLOY_INTERNAL: "true" by @talebzeghmi in #217
- Bump version to 1.3.x by @cloudw in #212
- Deprecate KFP preceding component by @cloudw in #211
- AIP-6514 local storage deprecation by @cloudw in #198
- Release/1.3 by @cloudw in #219
- AIP-6717 disable foreach tests by @talebzeghmi in #220
- AIP-6339: Zodiac service per nb by @aaron-arellano in #218
- AIP-6693 Create PVC just before the step by @talebzeghmi in #221
- AIP-6788 Optional PVC volume_type by @talebzeghmi in #223
- AIP-6753 @interruptible decorator by @talebzeghmi in #222
- AIP-6884 kfp-pod-default label on all KFP steps by @talebzeghmi in #225
- Update label to node.k8s.zgtools.net/capacity-type by @tmckay in #227
- Tz/AIP-6887-annotate-all-pods by @talebzeghmi in #228
- AIP-6950 Checkpoint SDK by @talebzeghmi in #229
- current.task_log_location() & checkpoint os.listdir fix by @talebzeghmi in #230
- Pin min zillow-kfp version to speed up version resolution by @cloudw in #231
- AIP-7181 log code path by @talebzeghmi in #233
- AIP-7275 Save capacity-type, host, instance-type to MF metadata by @talebzeghmi in #234
- Add yaml output format for Argo Workflow and WorkflowTemplate by @cloudw in #235
- Argo: Remove service account; Use static workflow name by @cloudw in #239
- Remove copy of .kube because of ci-cd-template changes by @tmckay in #244
- AIP-7497 set retry on exit_handler by @talebzeghmi in #246
- send messages to SQS exit-handler step in workflow if configured by @xiaowei-zillow in #247
- sqs_message_json DLQ param fix by @xiaowei-zillow in #248
- AIP-7702 AIP-7714 PVC retry and @Retry(minutes_between_retries) by @talebzeghmi in #255
- AIP-7738 retry_backoff_factor feature by @talebzeghmi in #264
New Contributors
- @aaron-arellano made their first contribution in #191
- @tmckay made their first contribution in #227
- @xiaowei-zillow made their first contribution in #247
Full Changelog: zg-1.2...1.3.2409+2.5.4
zg-1.2
To use this version you need build 1.2.1418+2.5.4
or above.
Main Changes
- Upstream Merge
- Pulling in all features until Metaflow 2.5.4
- Notably support for @card is added to visualize results . See related docs for more details.
- Features
- Compatibility Fix
- Fix compatibility issue with argo 1.5.0 (#174)
What's Changed
- AIP-5324 JSON parameters by @talebzeghmi in #149
- Update black run to use python 3.9 by @cloudw in #152
- Switching to using Github container registry for base image by @hsezhiyan in #153
- Flow triggering flow by @cloudw in #150
- Automate flow triggering flow test by @cloudw in #156
- Add zillow-kfp and kfp-server-api to default image by @cloudw in #157
- Merge 2.5.4 fixes by @cloudw in #165
- Merge 2 5 4 show conflicts by @cloudw in #168
- AIP-6005 Use PVC as tmp path in S3() by @talebzeghmi in #169
- Raise value error on tags that are too long by @cloudw in #142
- Use common image config in KFP by @cloudw in #172
- Add back unit test coverage by @cloudw in #171
- Rename KFP_CONTAINER_IMAGE to KFP_DEFAULT_CONTAINER_IMAGE by @cloudw in #174
- Fix env var name for argo wf name by @cloudw in #173
- Add service / k8s tags as system tags by @cloudw in #170
Full Changelog: zg-1.0...zg-1.2
zg-1.0
Main Changes
- Bugfixes for
@s3_sensor
, PLEG Stability and node utilization issues (usinghigh-memory
toleration) - Support for pytest coverage
- Switch to ZG version schema, so that our internal breaking changes are reflected in version number
What's Changed
- AIP-4600: Refactor Gitlab CI pipeline to include publishing lib to Artifactory by @alexlatchford in #115
- @kfp(image=) to support customers who want to specify image per step by @hsezhiyan in #120
- AIP-4600: Relax pylint version by @alexlatchford in #125
- AIP-5183 - Fixing regression in
@s3_sensor
by @hsezhiyan in #126 - Pod toleration based on CPU and memory by @cloudw in #129
- AIP-5103: Swap over feature branches to use dev releases versioning scheme by @alexlatchford in #128
- AIP-5103: Move to leverage the aip-py-cpu base image and remedy Python build errors by @alexlatchford in #132
- AIP-5283 - Fix
@s3_sensor
usage with @resources(volume=...) and --notify by @hsezhiyan in #131 - Support pytest coverage of customer Flows by @talebzeghmi in #127
- METAFLOW_COVERAGE_OMIT check for None by @talebzeghmi in #135
- AIP-5330 set default retry policy="Always" (even on PodDeletion) by @talebzeghmi in #134
- Handle None value of COVERAGE_OMIT by @cloudw in #140
- AIP-5068 - Reduce PLEG Stability Issues by @hsezhiyan in #137
- Use "purpose: high-memory" toleration instead of "instance-type: r5.12xlarge" by @cloudw in #143
- AIP-5333 -
@s3_sensor
resilient to failures by @hsezhiyan in #146
Full Changelog: 2.3.2+zg2.0...zg-1.0
Workflow SDK Release 2.3.2+zg2.0
The Workflow SDK 2.3.2+zg2.0
release is a major release.
Release Summary
- Upstream Merge:
- Pulling in all features until Metaflow 2.3.2
- Features
- (Breaking Change) Enforcing
Guaranteed
Quality of Service for Pods in KFP plugin (#118 ) - Stream logging enabled for KFP plugin
- Allow attached volume shared across split nodes
- Default pod labels for more detailed ZGCP costs ledger(#90 #92 #94)
- (Breaking Change) Enforcing
Features
Breaking Change - Enforcing Guaranteed
Quality of Service for Pods in KFP plugin
Pods that have limits way larger than requests have been a problem for cluster stability. In extreme cases hosts may have total burstable resource limits 50 times more than what's available. To resolve this issue we are trying to enforce Guaranteed
QoS across the board.
In the Workflow SDK, cpu_limit
, memory_limit
, and local_storage_limit
have been removed from @resource
decorator. Users can only provide single values for cpu
, memory
, or local_storage
, and both requests and limits will be set to the same value.
In Spark integration (spark related code change not in this repo):
- If the user provides limits.cpu
- If requests.cpu are also provided, limits.cores and requests.cores MUST have same values or ValueError will be raised
- If requests.cpu are NOT provided, limits.cpu will be used as the requests.cpu as well.
- If the user does not provide limits.cpu
- If requests.cpu are provided, limits.cpu = requests.cpu
- elif the user provides "spark.executor.cores", the value will be used for both limits.cpu and requests.cpu
- else limits.cpu and requests.cpu will be set to default "spark.executor.cores" which is 1
Stream logging for KFP plug-in
This is a feature pulled from upstream Metaflow version 2.2.10 and adapted for KFP plugin. Several changes:
- Logs are published to datastore via a sidecar process periodically. For KFP plugin logs used to be available in datastore only when the step finishes.
- You may access logs using
python flow.py <run-id>/<step-name>
- For retried steps, only logs from last retry will be printed. All logs are available in datastore.
Allow sharing attached volume across split nodes
By specifying @resources(volume_mode="ReadWriteMany", volume=<desired amount>)
, attached volume will be shared across split nodes of the same step.
Default pod labels for more detailed ZGCP costs ledger(#90 #92 #94)
By default pods are now labeled for their experiment
, flow
and step
name for more detailed cost tracking
Changes from upstream
Here is a partial list of changes from upstream that are applicable to ZG AI Platform.
For the full change list please see release notes from 2.2.5 to 2.3.2
Features
- Performance optimizations for merge_artifacts
- Execution logs are now available for all tasks in Metaflow universe
Bug Fix
- Handle regression with Click >=8.0.x
- Fix regression with ping/ endpoint for Metadata service
- Fix the behaviour of --namespace= CLI args when executing a flow
- Remove pinned pylint dependency
- List custom FlowSpec parameters in the intended order
- Fix @environment behavior for conflicting attribute values
- Fix environment is not callable error when using @environment
- Pipe all workflow set-up logs to stderr
- Handle null assignment to IncludeFile properly
2.2.5+zg1.1
Workflow SDK Release 2.2.5+zg1.1
Release Summary:
- Support Persistent Volume Claim (PVC) in
@resource
decorator #81- Please use if your step needs any disk space
- PyTorchDistributedDecorator is deprecated #88
- Support P3 GPU instance #83
- Improve Zodiac integration and cost tracking - automatic pod labeling for zodiac_service, zodiac_team #86
- Improve Datadog integration - automatic labeling for flow name, experiment name, run id, and step name #80
- Metadata reporting fix in CICD #86
Support for Persistent Volume Claim (PVC)
To use disk space, you can now specify persistent volume in @resource
decorator per step. It is as simple as
@resources(volume="30G")
@step
def my_task():
...
By default the volume is mounted to /opt/metaflow_volume
, and this volume is only available for the step decorated. If @retry
is used, the volume will be shared across retries of this step - nice if you want to pick up from previous progress, and be sure to clean up otherwise.
You have options to customize PVC mount path, or make the volume available to all steps onwards. Two additional attributes volume_dir
and volume_mode
are needed:
@resources(volume="30G", volume_mode="ReadWriteMany", volume_dir=<your_preferred_path>)
@step
def my_task():
...
Refer to doc string here for more details.
PyTorchDistributedDecorator
(@pytorch_distributed
) is deprecated due to implementation similarity.
P3 GPU Instance Support
We are adding an option for P3 instance when a more powerful GPU is handy - introducing @accelerator
decorator!
@accelerator
sets the taints and node label for your steps. To request P3 instance:
@accelerator(type="nvidia-tesla-v100")
@resources(...)
@step
def my_task():
...
While other instances can be requested similarly in the future, additional work is needed to support each type. Please let us (aip teams) know if other unsupported instance types suit your use cases better.
Improve Zodiac integration and cost tracking
Services now automatically tagged with zodiac_service
, zodiac_team
. As a result cost will be tracked in each team's Zodiac page base on namespace profile settings. Be sure to update your team's Kubeflow profile to take advantage of this feature
Improve Datadog integration
Flow name, experiment name, run id, and step name are automatically added to K8s pod labels.
Stay tuned for dashboard filters using these attributes.
Metadata reporting fix in CICD
Fix a bug where metadata is determined at compile time, and not correctly tracking run time environment when uploading to Metaflow service.