Workflow SDK Release 2.3.2+zg2.0
The Workflow SDK 2.3.2+zg2.0
release is a major release.
Release Summary
- Upstream Merge:
- Pulling in all features until Metaflow 2.3.2
- Features
- (Breaking Change) Enforcing
Guaranteed
Quality of Service for Pods in KFP plugin (#118 ) - Stream logging enabled for KFP plugin
- Allow attached volume shared across split nodes
- Default pod labels for more detailed ZGCP costs ledger(#90 #92 #94)
- (Breaking Change) Enforcing
Features
Breaking Change - Enforcing Guaranteed
Quality of Service for Pods in KFP plugin
Pods that have limits way larger than requests have been a problem for cluster stability. In extreme cases hosts may have total burstable resource limits 50 times more than what's available. To resolve this issue we are trying to enforce Guaranteed
QoS across the board.
In the Workflow SDK, cpu_limit
, memory_limit
, and local_storage_limit
have been removed from @resource
decorator. Users can only provide single values for cpu
, memory
, or local_storage
, and both requests and limits will be set to the same value.
In Spark integration (spark related code change not in this repo):
- If the user provides limits.cpu
- If requests.cpu are also provided, limits.cores and requests.cores MUST have same values or ValueError will be raised
- If requests.cpu are NOT provided, limits.cpu will be used as the requests.cpu as well.
- If the user does not provide limits.cpu
- If requests.cpu are provided, limits.cpu = requests.cpu
- elif the user provides "spark.executor.cores", the value will be used for both limits.cpu and requests.cpu
- else limits.cpu and requests.cpu will be set to default "spark.executor.cores" which is 1
Stream logging for KFP plug-in
This is a feature pulled from upstream Metaflow version 2.2.10 and adapted for KFP plugin. Several changes:
- Logs are published to datastore via a sidecar process periodically. For KFP plugin logs used to be available in datastore only when the step finishes.
- You may access logs using
python flow.py <run-id>/<step-name>
- For retried steps, only logs from last retry will be printed. All logs are available in datastore.
Allow sharing attached volume across split nodes
By specifying @resources(volume_mode="ReadWriteMany", volume=<desired amount>)
, attached volume will be shared across split nodes of the same step.
Default pod labels for more detailed ZGCP costs ledger(#90 #92 #94)
By default pods are now labeled for their experiment
, flow
and step
name for more detailed cost tracking
Changes from upstream
Here is a partial list of changes from upstream that are applicable to ZG AI Platform.
For the full change list please see release notes from 2.2.5 to 2.3.2
Features
- Performance optimizations for merge_artifacts
- Execution logs are now available for all tasks in Metaflow universe
Bug Fix
- Handle regression with Click >=8.0.x
- Fix regression with ping/ endpoint for Metadata service
- Fix the behaviour of --namespace= CLI args when executing a flow
- Remove pinned pylint dependency
- List custom FlowSpec parameters in the intended order
- Fix @environment behavior for conflicting attribute values
- Fix environment is not callable error when using @environment
- Pipe all workflow set-up logs to stderr
- Handle null assignment to IncludeFile properly