Releases: tony-framework/TonY
Release TonY 0.4.9
What's Changed
- Keep all resources close when client is killed by @zuston in #600
- Task failure handling mechanism: missed-heartbeat-failure is consistent with other failures by @zuston in #607
- Pass secret keys from AM to containers to support Hadoop encryption by @helloworld1 in #605
- Make untrackedTaskFailed volatile by @zuston in #608
- Release TonY v0.4.9 by @plliao in #609
Full Changelog: v0.4.8...v0.4.9
Release TonY 0.4.8
Changes in this release:
TonyClient to create FileSystem from Path to support fully qualified HDFS path (#598)
Prevent loss of root cause due to resetting the final state (#599)
Rename tony.worker.timeout to tony.task.executor.execution-timeout-ms (#596)
Set job failed when runtime is not healthy (#597)
Speed up ci test when am crashed (#594)
Fixed checkstyle suppressions invalid problem on windows (#590)
Remove task from heart beat monitor when container finished (#588)
Refactor tensorflow related class to tony (#583)
Remove tensorflow package (#582)
Catch unknown exception when retrieving task metrics (#581)
Make task executor's heart beat max failed number consistent with AM max-missed-heartbeats conf (#580)
Update TonY to include CII Best Practices badge (#576)
Introduce container allocation timeout (#575)
Clean up local tmp files in client (#574)
Release TonY 0.4.7
Change list:
TonYClient ignore connection error to prevent app failure when sending AM stop signal (#522)
Remove AMRM credentials on task executor (#527)
Introduce generic interface to support multiple frameworks (#529)
Introduce new registerCallbackInfo rpc endpoint (#530)
Introduce standalone runtime type (#533)
Support horovod (#524)
[Runtime] Make job fast fail when conf is illegal (#535)
[Horovod-Runtime] Introduce custom horovod driver script in debug mode (#540)
Re-enable tensorboard port reuse (#541)
Prevent the running containers from stopping when AM crash (#549)
Support sidecar tensorboard (#546)
Update jquery version to 3.5.0 (#556)
[Horovod] Using user-defined python exec path to start built-in Horovod driver (#555)
Add interface comparable for TonyTask (#551)
Allow to specify side-car job type for task to ignore its failure (#558)
Specify sidecar tensorboard with sidecar job type (#561)
Add estimator implementation for MNIST (#560)
Allow to specify sidecar tensorboard startup extra options (#564)
Make AM and TaskExecutor runtime interface separate (#562)
TonY should throw exception when gpu resource is not found on cluster (#565)
Introduce pluggable runtime provider (#566)
Compatible with Hadoop 2.6.0-cdh5.11.0 (#571)
release TonY 0.4.6 including numbers of enhancements
Change List:
403182d Make TonY client log layout more organized (#474)
8470968 [MINOR] Support specify timeout for AM waiting for client signal stop (#518)
a0e39ea [MINOR] Ignore updating task info connection error to prevent app failure (#517)
b3f96f1 When registrationTimeoutMS below 0, AM will wait forever (#520)
f000db4 Fast fail when container launch failed and not in stop.on.failure.jobtypes (#516)
346b086 [MINOR] Setting diagnostic msg to Yarn (#519)
06aef04 Reserve evaluator host spec in TF_CONFIG cluster, only when in evaluator process (#515)
c7407db Evaluator should be standalone with training cluster in TF (#512)
Fix TonY requests yarn config "yarn.io/gpu" on non-GPU clusters
Fault tolerance to missing resource paths
Proceed without failing if container resource paths do not exist
If container resource paths do not exist, we want to be able to still continue looking at the resource paths that were available.
Bug fixes
Bug fix: AM Retry prints - “Task was null! Nothing to schedule”
Fixed AM Retry prints - “Task was null! Nothing to schedule” due to accumulation of container request.
Bump up hadoop version to 2.10.0
Bump up hadoop version to 2.10.0
Bug fix: Exception in AM thread that causes TonY hangs
Handle the "NoMethodError" exception in AM thread due to incompatible avro version.