Skip to content

Workflow SDK Release 2.3.2+zg2.0

Compare
Choose a tag to compare
@cloudw cloudw released this 15 Sep 23:26
· 341 commits to feature/kfp since this release
6afc8f9

The Workflow SDK 2.3.2+zg2.0 release is a major release.

Release Summary

Features

Breaking Change - Enforcing Guaranteed Quality of Service for Pods in KFP plugin
Pods that have limits way larger than requests have been a problem for cluster stability. In extreme cases hosts may have total burstable resource limits 50 times more than what's available. To resolve this issue we are trying to enforce Guaranteed QoS across the board.

In the Workflow SDK, cpu_limit, memory_limit, and local_storage_limit have been removed from @resource decorator. Users can only provide single values for cpu, memory, or local_storage, and both requests and limits will be set to the same value.

In Spark integration (spark related code change not in this repo):

  • If the user provides limits.cpu
    • If requests.cpu are also provided, limits.cores and requests.cores MUST have same values or ValueError will be raised
    • If requests.cpu are NOT provided, limits.cpu will be used as the requests.cpu as well.
  • If the user does not provide limits.cpu
    • If requests.cpu are provided, limits.cpu = requests.cpu
    • elif the user provides "spark.executor.cores", the value will be used for both limits.cpu and requests.cpu
    • else limits.cpu and requests.cpu will be set to default "spark.executor.cores" which is 1

Stream logging for KFP plug-in
This is a feature pulled from upstream Metaflow version 2.2.10 and adapted for KFP plugin. Several changes:

  • Logs are published to datastore via a sidecar process periodically. For KFP plugin logs used to be available in datastore only when the step finishes.
  • You may access logs using python flow.py <run-id>/<step-name>
    • For retried steps, only logs from last retry will be printed. All logs are available in datastore.

Allow sharing attached volume across split nodes
By specifying @resources(volume_mode="ReadWriteMany", volume=<desired amount>), attached volume will be shared across split nodes of the same step.

Default pod labels for more detailed ZGCP costs ledger(#90 #92 #94)
By default pods are now labeled for their experiment, flow and step name for more detailed cost tracking

Changes from upstream

Here is a partial list of changes from upstream that are applicable to ZG AI Platform.
For the full change list please see release notes from 2.2.5 to 2.3.2

Features

Bug Fix