Releases: ray-project/ray
Ray-2.11.0
Release Highlights
- [data] Support reading Avro files with
ray.data.read_avro
- [train] Added experimental support for AWS Trainium (Neuron) and Intel HPU.
Ray Libraries
Ray Data
🎉 New Features:
- Support reading Avro files with
ray.data.read_avro
(#43663)
💫 Enhancements:
- Pin
ipywidgets==7.7.2
to enable Data progress bars in VSCode Web (#44398) - Change log level for ignored exceptions (#44408)
🔨 Fixes:
- Change Parquet encoding ratio lower bound from 2 to 1 (#44470)
- Fix throughput time calculations for metrics (#44138)
- Fix nested ragged
numpy.ndarray
(#44236) - Fix Ray debugger incompatibility caused by trimmed error stack trace (#44496)
📖 Documentation:
- Update "Data Loading and Preprocessing" doc (#44165)
- Move imports into
TFPRedictor
in batch inference example (#44434)
Ray Train
🎉 New Features:
- Add experimental support for AWS Trainium (Neuron) (#39130)
- Add experimental support for Intel HPU (#43343)
💫 Enhancements:
- Log a deprecation warning for local_dir and related environment variables (#44029)
- Enforce xgboost>=1.7 for XGBoostTrainer usage (#44269)
🔨 Fixes:
- Fix ScalingConfig(accelerator_type) to request an appropriate resource amount (#44225)
- Fix maximum recursion issue when serializing exceptions (#43952)
- Remove base config deepcopy when initializing the trainer actor (#44611)
🏗 Architecture refactoring:
- Remove deprecated
BatchPredictor
(#43934)
Ray Tune
💫 Enhancements:
- Add support for new style lightning import (#44339)
- Log a deprecation warning for local_dir and related environment variables (#44029)
🏗 Architecture refactoring:
- Remove scikit-optimize search algorithm (#43969)
Ray Serve
🔨 Fixes:
- Dynamically-created applications will no longer be deleted when a config is PUT via the REST API (#44476).
- Fix
_to_object_ref
memory leak (#43763) - Log warning to reconfigure
max_ongoing_requests
ifmax_batch_size
is less thanmax_ongoing_requests
(#43840) - Deployment fails to start with
ModuleNotFoundError
in Ray 3.10 (#44329)- This was fixed by reverting the original core changes on the
sys.path
behavior. Revert "[core] If there's working_dir, don't set _py_driver_sys_path." (#44435)
- This was fixed by reverting the original core changes on the
- The
batch_queue_cls
parameter is removed from the@serve.batch
decorator (#43935)
RLlib
🎉 New Features:
- New API stack: DQN Rainbow is now available for single-agent (#43196, #43198, #43199)
PrioritizedEpisodeReplayBuffer
is available for off-policy learning using the EnvRunner API (SingleAgentEnvRunner
) and supports random n-step sampling (#42832, #43258, #43458, #43496, #44262)
💫 Enhancements:
- Restructured
examples/
folder; started moving example scripts to the new API stack (#44559, #44067, #44603) - Evaluation do-over: Deprecate
enable_async_evaluation
option (in favor of existingevaluation_parallel_to_training
setting). (#43787) - Add:
module_for
API to MultiAgentEpisode (analogous topolicy_for
API of the old Episode classes). (#44241) - All
rllib_contrib
old stack algorithms have been removed fromrllib/algorithms
(#43656)
🔨 Fixes:
- New API stack: Multi-GPU + multi-agent has been fixed. This completes support for any combinations of the following on the new API stack: [single-agent, multi-agent] vs [0 GPUs, 1 GPU, >1GPUs] vs [any number of EnvRunners] (#44420, #44664, #44594, #44677, #44082, #44669, #44622)
- Various other bug fixes: #43906, #43871, #44000, #44340, #44491, #43959, #44043, #44446, #44040
📖 Documentation:
Ray Core and Ray Clusters
🎉 New Features:
- Added Ray check-open-ports CLI for checking potential open ports to the public (#44488)
💫 Enhancements:
- Support nodes sharing the same spilling directory without conflicts. (#44487)
- Create two subclasses of
RayActorError
to distinguish between actor died (ActorDiedError
) and actor temporarily unavailable (ActorUnavailableError
) cases.
🔨 Fixes:
- Fixed the
ModuleNotFound
issued introduced in 2.10 (#44435) - Fixed an issue where agent process is using too much CPU (#44348)
- Fixed race condition in multi-threaded actor creation (#44232)
- Fixed several streaming generator bugs (#44079, #44257, #44197)
- Fixed an issue where user exception raised from tasks cannot be subclassed (#44379)
Dashboard
💫 Enhancements:
- Add serve controller metrics to serve system dashboard page (#43797)
- Add Serve Application rows to Serve top-level deployments details page (#43506)
- [Actor table page enhancements] Include "NodeId", "CPU", "Memory", "GPU", "GRAM" columns in the actor table page. Add sort functionality to resource utilization columns. Enable searching table by "Class" and "Repr". (#42588) (#42633) (#42788)
🔨 Fixes:
- Fix default sorting of nodes in Cluster table page to first be by "Alive" nodes, then head nodes, then alphabetical by node ID. (#42929)
- Fix bug where the Serve Deployment detail page fails to load if the deployment is in "Starting" state (#43279)
Docs
💫 Enhancements:
- Landing page refreshes its look and feel. (#44251)
Thanks
Many thanks to all those who contributed to this release!
@aslonnie, @brycehuang30, @MortalHappiness, @astron8t-voyagerx, @edoakes, @sven1977, @anyscalesam, @scottjlee, @hongchaodeng, @slfan1989, @hebiao064, @fishbone, @zcin, @GeneDer, @shrekris-anyscale, @kira-lin, @chappidim, @raulchen, @c21, @WeichenXu123, @marian-code, @bveeramani, @can-anyscale, @mjd3, @justinvyu, @jackhumphries, @Bye-legumes, @ashione, @alanwguo, @Dreamsorcerer, @KamenShah, @jjyao, @omatthew98, @autolisis, @Superskyyy, @stephanie-wang, @simonsays1980, @davidxia, @angelinalg, @architkulkarni, @chris-ray-zhang, @kevin85421, @rynewang, @peytondmurray, @zhangyilun, @khluu, @matthewdeng, @ruisearch42, @pcmoritz, @mattip, @jerome-habana, @alexeykudinkin
Ray-2.10.0
Release Highlights
Ray 2.10 release brings important stability improvements and enhancements to Ray Data, with Ray Data becoming generally available (GA).
- [Data] Ray Data becomes generally available with stability improvements in streaming execution, reading and writing data, better tasks concurrency control, and debuggability improvement with dashboard, logging and metrics visualization.
- [RLlib] “New API Stack” officially announced as alpha for PPO and SAC.
- [Serve] Added a default autoscaling policy set via
num_replicas=”auto”
(#42613). - [Serve] Added support for active load shedding via
max_queued_requests
(#42950). - [Serve] Added replica queue length caching to the DeploymentHandle scheduler (#42943).
- This should improve overhead in the Serve proxy and handles.
max_ongoing_requests (max_concurrent_queries)
is also now strictly enforced (#42947).- If you see any issues, please report them on GitHub and you can disable this behavior by setting:
RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0
.
- [Serve] Renamed the following parameters. Each of the old names will be supported for another release before removal.
max_concurrent_queries
->max_ongoing_requests
target_num_ongoing_requests_per_replica
->target_ongoing_requests
downscale_smoothing_factor
->downscaling_factor
upscale_smoothing_factor
->upscaling_factor
- [Core] Autoscaler v2 is in alpha and can be tried out with Kuberay. It has improved observability and stability compared to v1.
- [Train] Added support for accelerator types via
ScalingConfig(accelerator_type)
. - [Train] Revamped the
XGBoostTrainer
andLightGBMTrainer
to no longer depend onxgboost_ray
andlightgbm_ray
. A new, more flexible API will be released in a future release. - [Train/Tune] Refactored local staging directory to remove the need for
local_dir
andRAY_AIR_LOCAL_CACHE_DIR
.
Ray Libraries
Ray Data
🎉 New Features:
- Streaming execution stability improvement to avoid memory issue, including per-operator resource reservation, streaming generator output buffer management, and better runtime resource estimation (#43026, #43171, #43298, #43299, #42930, #42504)
- Metadata read stability improvement to avoid AWS transient error, including retry on application-level exception, spread tasks across multiple nodes, and configure retry interval (#42044, #43216, #42922, #42759).
- Allow tasks concurrency control for read, map, and write APIs (#42849, #43113, #43177, #42637)
- Data dashboard and statistics improvement with more runtime metrics for each components (#43790, #43628, #43241, #43477, #43110, #43112)
- Allow to specify application-level error to retry for actor task (#42492)
- Add
num_rows_per_file
parameter to file-based writes (#42694) - Add
DataIterator.materialize
(#43210) - Skip schema call in
DataIterator.to_tf
iftf.TypeSpec
is provided (#42917) - Add option to append for
Dataset.write_bigquery
(#42584) - Deprecate legacy components and classes (#43575, #43178, #43347, #43349, #43342, #43341, #42936, #43144, #43022, #43023)
💫 Enhancements:
- Restructure stdout logging for better readability (#43360)
- Add a more performant way to read large TFRecord datasets (#42277)
- Modify
ImageDatasource
to useImage.BILINEAR
as the default image resampling filter (#43484) - Reduce internal stack trace output by default (#43251)
- Perform incremental writes to Parquet files (#43563)
- Warn on excessive driver memory usage during shuffle ops (#42574)
- Distributed reads for
ray.data.from_huggingface
(#42599) - Remove
Stage
class and related usages (#42685) - Improve stability of reading JSON files to avoid PyArrow errors (#42558, #42357)
🔨 Fixes:
- Turn off actor locality by default (#44124)
- Normalize block types before internal multi-block operations (#43764)
- Fix memory metrics for
OutputSplitter
(#43740) - Fix race condition issue in
OpBufferQueue
(#43015) - Fix early stop for multiple
Limit
operators. (#42958) - Fix deadlocks caused by
Dataset.streaming_split
for job hanging (#42601)
📖 Documentation:
Ray Train
🎉 New Features:
- Add support for accelerator types via
ScalingConfig(accelerator_type)
for improved worker scheduling (#43090)
💫 Enhancements:
- Add a backend-specific context manager for
train_func
for setup/teardown logic (#43209) - Remove
DEFAULT_NCCL_SOCKET_IFNAME
to simplify network configuration (#42808) - Colocate Trainer with rank 0 Worker for to improve scheduling behavior (#43115)
🔨 Fixes:
- Enable scheduling workers with
memory
resource requirements (#42999) - Make path behavior OS-agnostic by using
Path.as_posix
overos.path.join
(#42037) - [Lightning] Fix resuming from checkpoint when using
RayFSDPStrategy
(#43594) - [Lightning] Fix deadlock in
RayTrainReportCallback
(#42751) - [Transformers] Fix checkpoint reporting behavior when
get_latest_checkpoint
returns None (#42953)
📖 Documentation:
- Enhance docstring and user guides for
train_loop_config
(#43691) - Clarify in
ray.train.report
docstring that it is not a barrier (#42422) - Improve documentation for
prepare_data_loader
shuffle behavior andset_epoch
(#41807)
🏗 Architecture refactoring:
- Simplify XGBoost and LightGBM Trainer integrations. Implemented
XGBoostTrainer
andLightGBMTrainer
asDataParallelTrainer
. Removed dependency onxgboost_ray
andlightgbm_ray
. (#42111, #42767, #43244, #43424) - Refactor local staging directory to remove the need for
local_dir
andRAY_AIR_LOCAL_CACHE_DIR
. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written tostorage_path
, rather than having another copy in the user’s home directory (~/ray_results
). (#43369, #43403, #43689) - Split overloaded
ray.train.torch.get_device
into anotherget_devices
API for multi-GPU worker setup (#42314) - Refactor restoration configuration to be centered around
storage_path
(#42853, #43179) - Deprecations related to
SyncConfig
(#42909) - Remove deprecated
preprocessor
argument from Trainers (#43146, #43234) - Hard-deprecate
MosaicTrainer
and removeSklearnTrainer
(#42814)
Ray Tune
💫 Enhancements:
- Increase the minimum number of allowed pending trials for faster auto-scaleup (#43455)
- Add support to
TBXLogger
for logging images (#37822) - Improve validation of
Experiment(config)
to handle RLlibAlgorithmConfig
(#42816, #42116)
🔨 Fixes:
- Fix
reuse_actors
error on actor cleanup for function trainables (#42951) - Make path behavior OS-agnostic by using Path.as_posix over
os.path.join
(#42037)
📖 Documentation:
🏗 Architecture refactoring:
- Refactor local staging directory to remove the need for
local_dir
andRAY_AIR_LOCAL_CACHE_DIR
. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written tostorage_path
, rather than having another copy in the user’s home directory (~/ray_results
). (#43369, #43403, #43689) - Deprecations related to
SyncConfig
andchdir_to_trial_dir
(#42909) - Refactor restoration configuration to be centered around
storage_path
(#42853, #43179) - Add back
NevergradSearch
(#42305) - Clean up invalid
checkpoint_dir
andreporter
deprecation notices (#42698)
Ray Serve
🎉 New Features:
- Added support for active load shedding via
max_queued_requests
(#42950). - Added a default autoscaling policy set via
num_replicas=”auto”
(#42613).
🏗 API Changes:
- Renamed the following parameters. Each of the old names will be supported for another release before removal.
max_concurrent_queries
tomax_ongoing_requests
target_num_ongoing_requests_per_replica
totarget_ongoing_requests
downscale_smoothing_factor
todownscaling_factor
upscale_smoothing_factor
toupscaling_factor
- WARNING: the following default values will change in Ray 2.11:
- Default for
max_ongoing_requests
will change from 100 to 5. - Default for
target_ongoing_requests
will change from 1 to 2.
- Default for
💫 Enhancements:
- Add
RAY_SERVE_LOG_ENCODING
env to set the global logging behavior for Serve (#42781). - Config Serve's gRPC proxy to allow large payload (#43114).
- Add blocking flag to serve.run() (#43227).
- Add actor id and worker id to Serve structured logs (#43725).
- Added replica queue length caching to the DeploymentHandle scheduler (#42943).
- This should improve overhead in the Serve proxy and handles.
max_ongoing_requests
(max_concurrent_queries
) is also now strictly enforced (#42947).- If you see any issues, please report them on GitHub and you can disable this behavior by setting:
RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0
.
- Autoscaling metrics (trackin...
Ray-2.9.3
This patch release contains fixes for Ray Core, Ray Data, and Ray Serve.
Ray Core
🔨 Fixes:
- Fix protobuf breaking change by adding a compat layer. (#43172)
- Bump up task failure logs to warnings to make sure failures could be troubleshooted (#43147)
- Fix placement group leaks (#42942)
Ray Data
🔨 Fixes:
- Skip
schema
call into_tf
iftf.TypeSpec
is provided (#42917) - Skip recording memory spilled stats when get_memory_info_reply is failed (#42824)
Ray Serve
🔨 Fixes:
- Fixing DeploymentStateManager qualifying replicas as running prematurely (#43075)
Thanks
Many thanks to all those who contributed to this release!
@rynewang, @GeneDer, @alexeykudinkin, @edoakes, @c21, @rkooo567
Ray-2.9.2
This patch release contains fixes for Ray Core, Ray Data, and Ray Serve.
Ray Core
🔨 Fixes:
- Fix out of disk test on release branch (#42724)
Ray Data
🔨 Fixes:
- Fix failing huggingface test (#42727)
- Fix deadlocks caused by streaming_split (#42601) (#42755)
- Fix locality config not being respected in DataConfig (#42204
#42204) (#42722) - Stability & accuracy improvements for Data+Train benchmark (#42027)
- Add retry for _sample_fragment during
ParquetDatasource._estimate_files_encoding_ratio()
(#42759) (#42774) - Skip recording memory spilled stats when get_memory_info_reply is failed (#42824) (#42834)
Ray Serve
🔨 Fixes:
- Pin the fastapi & starlette version to avoid breaking proxy (#42740
#42740) - Fix IS_PYDANTIC_2 logic for pydantic<1.9.0 (#42704) (#42708)
- fix missing message body for json log formats (#42729) (#42874)
Thanks
Many thanks to all those who contributed to this release!
@c21, @raulchen, @can-anyscale, @edoakes, @peytondmurray, @scottjlee, @aslonnie, @architkulkarni, @GeneDer, @Zandew, @sihanwang41
Ray-2.9.1
This patch release contains fixes for Ray Core, Ray Data, and Ray Serve.
Ray Core
🔨 Fixes:
- Adding debupgy as the ray debugger (#42311)
- Fix task events profile events per task leak (#42248)
- Make sure redis sync context and async context connect to the same redis instance (#42040)
Ray Data
🔨 Fixes:
- [Data] Retry write if error during file clean up (#42326)
Ray Serve
🔨 Fixes:
Ray-2.9.0
Release Highlights
- This release contains fixes for the Ray Dashboard. Additional context can be found here: https://www.anyscale.com/blog/update-on-ray-cves-cve-2023-6019-cve-2023-6020-cve-2023-6021-cve-2023-48022-cve-2023-48023
- Ray Train has now upgraded support for spot node preemption -- allowing Ray Train to handle preemption node failures differently than application errors.
- Ray is now compatible with Pydantic versions <2.0.0 and >=2.5.0, addressing a piece of user feedback we’ve consistently received.
- The Ray Dashboard now has a page for Ray Data to monitor real-time execution metrics.
- Streaming generator is now officially a public API (#41436, #38784). Streaming generator allows writing streaming applications easily on top of Ray via Python generator API and has been used for Ray Serve and Ray data for several releases. See the documentation for details.
- We’ve added experimental support for new accelerators: Intel GPU (#38553), Intel Gaudi Accelerators (#40561), and Huawei Ascend NPU (#41256).
Ray Libraries
Ray Data
🎉 New Features:
- Add the dashboard for Ray Data to monitor real-time execution metrics and log file for debugging (https://docs.ray.io/en/master/data/monitoring-your-workload.html).
- Introduce
concurrency
argument to replaceComputeStrategy
in map-like APIs (#41461) - Allow task failures during execution (#41226)
- Support PyArrow 14.0.1 (#41036)
- Add new API for reading and writing Datasource (#40296)
- Enable group-by over multiple keys in datasets (#37832)
- Add support for multiple group keys in
map_groups
(#40778)
💫 Enhancements:
- Optimize
OpState.outqueue_num_blocks
(#41748) - Improve stall detection for
StreamingOutputsBackpressurePolicy
(#41637) - Enable read-only Datasets to be executed on new execution backend (#41466, #41597)
- Inherit block size from downstream ops (#41019)
- Use runtime object memory for scheduling (#41383)
- Add retries to file writes (#41263)
- Make range datasource streaming (#41302)
- Test core performance metrics (#40757)
- Allow
ConcurrencyCapBackpressurePolicy._cap_multiplier
to be set to 1.0 (#41222) - Create
StatsManager
to manage_StatsActor
remote calls (#40913) - Expose
max_retry_cnt
parameter forBigQuery
Write (#41163) - Add rows outputted to data metrics (#40280)
- Add fault tolerance to remote tasks (#41084)
- Add operator-level dropdown to ray data overview (#40981)
- Avoid slicing too-small blocks (#40840)
- Ray Data jobs detail table (#40756)
- Update default shuffle block size to 1GB (#40839)
- Log progress bar to data logs (#40814)
- Operator level metrics (#40805)
🔨 Fixes:
- Partial fix for
Dataset.context
not being sealed after creation (#41569) - Fix the issue that
DataContext
is not propagated when usingstreaming_split
(#41473) - Fix Parquet partition filter bug (#40947)
- Fix split read output blocks (#41070)
- Fix
BigQueryDatasource
fault tolerance bugs (#40986)
📖 Documentation:
- Add example of how to read and write custom file types (#41785)
- Fix
ray.data.read_databricks_tables
doc (#41366) - Add
read_json
docs example for setting PyArrow block size when reading large files (#40533) - Add
AllToAllAPI
to dataset methods (#40842)
Ray Train
🎉 New Features:
- Support reading
Result
from cloud storage (#40622)
💫 Enhancements:
- Sort local Train workers by GPU ID (#40953)
- Improve logging for Train worker scheduling information (#40536)
- Load the latest unflattened metrics with
Result.from_path
(#40684) - Skip incrementing failure counter on preemption node died failures (#41285)
- Update TensorFlow
ReportCheckpointCallback
to delete temporary directory (#41033)
🔨 Fixes:
- Update config dataclass repr to check against None (#40851)
- Add a barrier in Lightning
RayTrainReportCallback
to ensure synchronous reporting. (#40875) - Restore Tuner and
Result
s properly from moved storage path (#40647)
📖 Documentation:
- Improve torch, lightning quickstarts and migration guides + fix torch restoration example (#41843)
- Clarify error message when trying to use local storage for multi-node distributed training and checkpointing (#41844)
- Copy edits and adding links to docstrings (#39617)
- Fix the missing ray module import in PyTorch Guide (#41300)
- Fix typo in lightning_mnist_example.ipynb (#40577)
- Fix typo in deepspeed.rst (#40320)
🏗 Architecture refactoring:
- Remove Legacy Trainers (#41276)
Ray Tune
🎉 New Features:
- Support reading
Result
from cloud storage (#40622)
💫 Enhancements:
- Skip incrementing failure counter on preemption node died failures (#41285)
🔨 Fixes:
- Restore Tuner and
Result
s properly from moved storage path (#40647)
📖 Documentation:
- Remove low value Tune examples and references to them (#41348)
- Clarify when to use
MLflowLoggerCallback
andsetup_mlflow
(#37854)
🏗 Architecture refactoring:
- Delete legacy
TuneClient
/TuneServer
APIs (#41469) - Delete legacy
Searcher
s (#41414) - Delete legacy persistence utilities (
air.remote_storage
, etc.) (#40207)
Ray Serve
🎉 New Features:
- Introduce logging config so that users can set different logging parameters for different applications & deployments.
- Added gRPC context object into gRPC deployments for user to set custom code and details back to the client.
- Introduce a runtime environment feature that allows running applications in different containers with different images. This feature is experimental and a new guide can be found in the Serve docs.
💫 Enhancements:
- Explicitly handle gRPC proxy task cancellation when the client dropped a request to not waste compute resources.
- Enable async
__del__
in the deployment to execute custom clean up steps. - Make Ray Serve compatible with Pydantic versions <2.0.0 and >=2.5.0.
🔨 Fixes:
- Fixed gRPC proxy streaming request latency metrics to include the entire lifecycle of the request, including the time to consume the generator.
- Fixed gRPC proxy timeout request status from CANCELLED to DEADLINE_EXCEEDED.
- Fixed previously Serve shutdown spamming log files with logs for each event loop to only log once on shutdown.
- Fixed issue during batch requests when a request is dropped, the batch loop will be killed and not processed any future requests.
- Updating replica log filenames to only include POSIX-compliant characters (removed the “#” character).
- Replicas will now be gracefully shut down after being marked unhealthy due to health check failures instead of being force killed.
- This behavior can be toggled using the environment variable RAY_SERVE_FORCE_STOP_UNHEALTHY_REPLICAS=1, but this is planned to be removed in the near future. If you rely on this behavior, please file an issue on github.
RLlib
🎉 New Features:
- New API stack (in progress):
- New
MultiAgentEpisode
class introduced. Basis for upcoming multi-agent EnvRunner, which will replace RolloutWorker APIs. (#40263, #40799) - PPO runs with new
SingleAgentEnvRunner
(w/o Policy/RolloutWorker APIs). CI learning tests added. (#39732, #41074, #41075) - By default: PPO reverted to use old API stack by default, for now. Pending feature-completion of new API stack (incl. multi-agent, RNN support, new EnvRunners, etc..). (#40706)
- New
- Old API stack:
💫 Enhancements:
🔨 Fixes:
- Restoring from a checkpoint from an older wheel (where
AlgorithmConfig.rl_module_spec
was NOT a “@Property” yet) breaks when trying to load from this checkpoint. (#41157) - SampleBatch slicing crashes when using tf + SEQ_LENS + zero-padding. (#40905)
- Other fixes: #39978, #40788, #41168, #41204
📖 Documentation:
- Updated codeblocks in RLlib. (#37271)
Ray Core and Ray Clusters
Ray Core
🎉 New Features:
- Streaming generator is now officially a public API (#41436, #38784). Streaming generator allows writing streaming applications easily on top of Ray via Python generator API and has been used for Ray serve and Ray data for several releases. See the documentation for details.
- As part of the change, num_returns=”dynamic” is planning to be deprecated, and its return type is changed from
ObjectRefGenerator
-> “DynamicObjectRefGenerator”
- As part of the change, num_returns=”dynamic” is planning to be deprecated, and its return type is changed from
- Add experimental accelerator support for new hardwares.
- Add the initial support to run MPI based code on top of Ray.(#40917, #41349)
💫 Enhancements:
- Optimize next/anext performance for streaming generator (#41270)
- Make the number of connections and thread number of the object manager client tunable. (#41421)
- Add
__ray_call__
default actor method (#41534)
🔨 Fixes:
Ray-2.8.1
Release Highlights
The Ray 2.8.1 patch release contains fixes for the Ray Dashboard.
Additional context can be found here: https://www.anyscale.com/blog/update-on-ray-cves-cve-2023-6019-cve-2023-6020-cve-2023-6021-cve-2023-48022-cve-2023-48023
Ray Dashboard
🔨 Fixes:
[core][state][log] Cherry pick changes to prevent state API from reading files outside the Ray log directory (#41520)
[Dashboard] Migrate Logs page to use state api. (#41474) (#41522)
Ray-2.8.0
Release Highlights
This release features stability improvements and API clean-ups across the Ray libraries.
- In Ray Serve, we are deprecating the previously experimental DAG API for deployment graphs. Model composition will be supported through deployment handles providing more flexibility and stability. The previously deprecated Ray Serve 1.x APIs have also been removed. We’ve also added a new Java APIs that aligns with the Ray Serve 2.x APIs. More API changes in the release notes below.
- In RLlib, we’ve moved 24 algorithms into
rllib_contrib
(still available within RLlib for Ray 2.8). - We’ve added support for PyTorch-compatible input files shuffling for Ray Data. This allows users to randomly shuffle input files for better model training accuracy. This release also features new Ray Data datasources for Databricks and BigQuery.
- On the Ray Dashboard, we’ve added new metrics for Ray Data in the Metrics tab. This allows users to monitor Ray Data workload including real time metrics of cluster memory, CPU, GPU, output data size, etc. See the doc for more details.
- Ray Core now supports profiling GPU tasks or actors using Nvidia Nsight. See the documentation for instructions.
- We fixed 2 critical bugs raised by many kuberay / ML library users, including a child process leak issue from Ray worker that leaks the GPU memory (#40182) and an job page excessive loading time issue when Ray HA cluster restarts a head node (#40742)
- Python 3.7 support is officially deprecated from Ray.
Ray Libraries
Ray Data
🎉 New Features:
- Add support for shuffling input files (#40154)
- Support streaming read of PyTorch dataset (#39554)
- Add BigQuery datasource (#37380)
- Add Databricks table / SQL datasource (#39852)
- Add inverse transform functionality to LabelEncoder (#37785)
- Add function arg params to
Dataset.map
andDataset.flat_map
(#40010)
💫Enhancements:
- Hard deprecate
DatasetPipeline
(#40129) - Remove
BulkExecutor
code path (#40200) - Deprecate extraneous
Dataset
parameters and methods (#40385) - Remove legacy iteration code path (#40013)
- Implement streaming output backpressure (#40387)
- Cap op concurrency with exponential ramp-up (#40275)
- Store ray dashboard metrics in
_StatsActor
(#40118) - Slice output blocks to respect target block size (#40248)
- Drop columns before grouping by in
Dataset.unique()
(#40016) - Standardize physical operator runtime metrics (#40173)
- Estimate blocks for limit and union operator (#40072)
- Store bytes spilled/restored after plan execution (#39361)
- Optimize
sample_boundaries
inSortTaskSpec
(#39581) - Optimization to reduce ArrowBlock building time for blocks of size 1 (#38833)
🔨 Fixes:
- Fix bug where
_StatsActor
errors withPandasBlock
(#40481) - Remove deprecated
do_write
(#40422) - Improve error message when reading HTTP files (#40462)
- Add flag to skip
get_object_locations
for metrics (#39884) - Fall back to fetch files info in parallel for multiple directories (#39592)
- Replace deprecated
.pieces
with updated.fragments
(#39523) - Backwards compatibility for
Preprocessor
that have been fit in older versions (#39173) - Removing unnecessary data copy in
convert_udf_returns_to_numpy
(#39188) - Do not eagerly free root
RefBundles
(#39016)
📖Documentation:
Ray Train
🎉 New Features:
- Add initial support for scheduling workers on neuron_cores (#39091)
💫Enhancements:
- Update PyTorch Lightning import path to support both
pytorch_lightning
andlightning
(#39841, #40266) - Propagate driver
DataContext
toRayTrainWorkers
(#40116)
🔨 Fixes:
- Fix error propagation for as_directory if to_directory fails (#40025)
📖Documentation:
- Update checkpoint hierarchy documentation for RayTrainReportCallbacks. (#40174)
- Update Lightning RayDDPStrategy docstring (#40376)
🏗 Architecture refactoring:
- Deprecate
LightningTrainer
,AccelerateTrainer
, `TransformersTrainer (#40163) - Clean up legacy persistence mode code paths (#39921, #40061, #40069, #40168)
- Deprecate legacy
DatasetConfig
(#39963) - Remove references to
DatasetPipeline
(#40159) - Enable isort (#40172)
Ray Tune
💫Enhancements:
- Separate storage checkpoint index bookkeeping (#39927, #40003)
- Raise an error if
Tuner.restore()
is called on an instance (#39676)
🏗 Architecture refactoring: - Clean up legacy persistence mode code paths (#39918, #40061, #40069, #40168, #40175, #40192, #40181, #40193)
- Migrate TuneController tests (#39704)
- Remove TuneRichReporter (#40169)
- Remove legacy Ray Client tests (#40415)
Ray Serve
💫Enhancements:
- The single-app configuration format for the Serve Config (i.e. the Serve Config without the ‘applications’ field) has been deprecated in favor of the new configuration format.
Both single-app configuration and DAG API will be removed in 2.9. - The Serve REST API is now accessible through the dashboard port, which defaults to
8265
. - Accessing the Serve REST API through the dashboard agent port (default
52365
) is deprecated. The support will be removed in a future version. - Ray job error tracebacks are now logged in the job driver log for easier access when jobs fail during start up.
- Deprecated single-application config file
- Deprecated DAG API:
InputNode
andDAGDriver
- Removed deprecated Deployment 1.x APIs:
Deployment.deploy()
,Deployment.delete()
,Deployment.get_handle()
- Removed deprecated 1.x API:
serve.get_deployment
andserve.list_deployments
- New Java API supported (aligns with Ray Serve 2.x API)
🔨 Fixes:
- The
dedicated_cpu
anddetached
options inserve.start()
have been fully disallowed. - Error will be raised when users pass invalid gRPC service functions and fail early.
- The proxy’s readiness check now uses a linear backoff to avoid getting stuck in an infinite loop if it takes longer than usual to start.
grpc_options
onserve.start()
was only allowing agRPCOptions
object in Ray 2.7.0. Dictionaries are now allowed to be used asgrpc_options
in theserve.start()
call.
RLlib
💫Enhancements:
rllib_contrib
algorithms (A2C, A3C, AlphaStar #36584, AlphaZero #36736, ApexDDPG #36596, ApexDQN #36591, ARS #36607, Bandits #36612, CRR #36616, DDPG, DDPPO #36620, Dreamer(V1), DT #36623, ES #36625, LeelaChessZero #36627, MA-DDPG #36628, MAML, MB-MPO #36662, PG #36666, QMix #36682, R2D2, SimpleQ #36688, SlateQ #36710, and TD3 #36726) all produce warnings now if used. See here for more information on therllib_contrib
efforts. (36620, 36628, 3- Provide msgpack checkpoint translation utility to convert checkpoint into msgpack format for being able to move in between python versions (#38825).
🔨 Fixes:
- Issue 35440 (JSON output writer should include INFOS #39632)
- Issue 39453 (PettingZoo wrappers should use correct multi-agent dict spaces #39459)
- Issue 39421 (Multi-discrete action spaces not supported in new stack #39534)
- Issue 39234 (Multi-categorical distribution bug #39464)
#39654, #35975, #39552, #38555
Ray Core and Ray Clusters
Ray Core
🎉 New Features:
- Python 3.7 support is officially deprecated from Ray.
- Supports profiling GPU tasks or actors using Nvidia Nsight. See the doc for instructions.
- Ray on spark autoscaling is officially supported from Ray 2.8. See the REP for more details.
💫Enhancements: - IDLE node information in detail is available from ray status -v (#39638)
- Adding a new accelerator to Ray is simplified with a new accelerator interface. See the in-flight REP for more details (#40286).
- Typing_extensions is removed from a dependency requirement because Python 3.7 support is deprecated. (#40336)
- Ray state API supports case insensitive match. (#34577)
ray start --runtime-env-agent-port
is officially supported. (#39919)- Driver exit code is available fromjob info (#39675)
🔨 Fixes:
- Fixed a worker leak when Ray is used with placement group because Ray didn’t handle SIGTERM properly (#40182)
- Fixed an issue job page loading takes a really long time when Ray HA cluster restarts a head node (#40431)
- [core] loosen the check on release object (#39570)
- [Core] ray init sigterm (#39816)
- [Core] Non Unit Instance fractional value fix (#39293)
- [Core]: Enable get_actor_name for actor runtime context (#39347)
- [core][streaming][python] Fix asyncio.wait coroutines args deprecated warnings #40292
📖Documentation:
- The Ray streaming generator doc (alpha) is officially available at https://docs.ray.io/en/master/ray-core/ray-generator.html
Ray Clusters
💫Enhancements:
- Enable GPU support for vSphere cluster launcher (#40667)
📖Documentation:
- Setup RBAC by KubeRay Helm chart
- KubeRay upgrade documentation
- RayService high availability
🔨 Fixes:
Dashboard
🎉 New Features:
- New metrics for ray data can be found in the Metrics tab.
🔨 Fixes: - Fix bug where download log button did not download all logs for actors.
Thanks
Many thanks to all who contribute...
Ray-2.7.1
Release Highlights
- Ray Serve:
- Added an
application
tag to theray_serve_num_http_error_requests
metric - Fixed a bug where no data shows up on the
Error QPS per Application
panel in the Ray Dashboard
- Added an
- RLlib:
- DreamerV3: Bug fix enabling support for continuous actions.
- Ray Train:
- Fix a bug where setting a local storage path on Windows errors (#39951)
- Ray Tune:
- Fix a broken
Trial.node_ip
property (#40028)
- Fix a broken
- Ray Core:
- Fixes a segfault when a streaming generator and actor cancel is used together
- Fix autoscaler sdk accidentally initialize ray worker leading to leaked driver showing up in the dashboard.
- Added a new user guide and fixes for the vSphere cluster launcher.
- Fixed a bug where
ray start
would occasionally fail withValueError:
acceleratorTypeshould match v(generation)-(cores/chips).
- Dashboard:
- Improvement on cluster page UI
- Fix a bug that overview page UI will crash
Ray Libraries
Ray Serve
🔨 Fixes:
- Fixed a bug where no data shows up on the
Error QPS per Application
panel in the Ray Dashboard
RLlib
🔨 Fixes:
- DreamerV3: Bug fix enabling support for continuous actions (39751).
Ray Core and Ray Clusters
🔨 Fixes:
- Fixed Ray cluster stability on a high latency environment
Thanks
Many thanks to all those who contributed to this release!
@chaowanggg, @allenwang28, @shrekris-anyscale, @GeneDer, @justinvyu, @can-anyscale, @edoakes, @architkulkarni, @rkooo567, @rynewang, @rickyyx, @sven1977
Ray-2.7.0
Release Highlights
Ray 2.7 release brings important stability improvements and enhancements to Ray libraries, with Ray Train and Ray Serve becoming generally available. Ray 2.7 is accompanied with a GA release of KubeRay.
- Following user feedback, we are rebranding “Ray AI Runtime (AIR)” to “Ray AI Libraries”. Without reducing any of the underlying functionality of the original Ray AI runtime vision as put forth in Ray 2.0, the underlying namespace (ray.air) is consolidated into ray.data, ray.train, and ray.tune. This change reduces the friction for new machine learning (ML) practitioners to quickly understand and leverage Ray for their production machine learning use cases.
- With this release, Ray Serve and Ray Train’s Pytorch support are becoming Generally Available -- indicating that the core APIs have been marked stable and that both libraries have undergone significant production hardening.
- In Ray Serve, we are introducing a new backwards-compatible
DeploymentHandle
API to unify various existing Handle APIs, a high performant gRPC proxy to serve gRPC requests through Ray Serve, along with various stability and usability improvements. - In Ray Train, we are consolidating various Pytorch-based trainers into the TorchTrainer, reducing the amount of refactoring work new users needed to scale existing training scripts. We are also introducing a new train.Checkpoint API, which provides a consolidated way of interacting with remote and local storage, along with various stability and usability improvements.
- In Ray Core, we’ve added initial integrations with TPUs and AWS accelerators, enabling Ray to natively detect these devices and schedule tasks/actors onto them. Ray Core also officially now supports actor task cancellation and has an experimental streaming generator that supports streaming response to the caller.
Take a look at our refreshed documentation and the Ray 2.7 migration guide and let us know your feedback!
Ray Libraries
Ray AIR
🏗 Architecture refactoring:
- Ray AIR namespace: We are sunsetting the "Ray AIR" concept and namespace (#39516, #38632, #38338, #38379, #37123, #36706, #37457, #36912, #37742, #37792, #37023). The changes follow the proposal outlined in this REP.
- Ray Train Preprocessors, Predictors: We now recommend using Ray Data instead of Preprocessors (#38348, #38518, #38640, #38866) and Predictors (#38209).
Ray Data
🎉 New Features:
- In this release, we’ve integrated the Ray Core streaming generator API by default, which allows us to reduce memory footprint throughout the data pipeline (#37736).
- Avoid unnecessary data buffering between
Read
andMap
operator (zero-copy fusion) (#38789) - Add
Dataset.write_images
to write images (#38228) - Add
Dataset.write_sql()
to write SQL databases (#38544) - Support sort on multiple keys (#37124)
- Support reading and writing JSONL file format (#37637)
- Support class constructor args for
Dataset.map()
andflat_map()
(#38606) - Implement streamed read from Hugging Face Dataset (#38432)
💫Enhancements:
- Read data with multi-threading for
FileBasedDataSource
(#39493) - Optimization to reduce
ArrowBlock
building time for blocks of size 1 (#38988) - Add
partition_filter
parameter toread_parquet
(#38479) - Apply limit to
Dataset.take()
and related methods (#38677) - Postpone
reader.get_read_tasks
until execution (#38373) - Lazily construct metadata providers (#38198)
- Support writing each block to a separate file (#37986)
- Make
iter_batches
an Iterable (#37881) - Remove default limit on
Dataset.to_pandas()
(#37420) - Add
Dataset.to_dask()
parameter to toggle consistent metadata check (#37163) - Add
Datasource.on_write_start
(#38298) - Remove support for
DatasetDict
as input intofrom_huggingface()
(#37555)
🔨 Fixes:
- Backwards compatibility for
Preprocessor
that have been fit in older versions (#39488) - Do not eagerly free root
RefBundles
(#39085) - Retry open files with exponential backoff (#38773)
- Avoid passing
local_uri
to all non-Parquet data sources (#38719) - Add
ctx
parameter toDatasource.write
(#38688) - Preserve block format on
map_batches
over empty blocks (#38161) - Fix args and kwargs passed to
ActorPool
map_batches
(#38110) - Add
tif
file extension toImageDatasource
(#38129) - Raise error if PIL can't load image (#38030)
- Allow automatic handling of string features as byte features during TFRecord serialization (#37995)
- Remove unnecessary file system wrapping (#38299)
- Remove
_block_udf
fromFileBasedDatasource
reads (#38111)
📖Documentation:
Ray Train
🤝 API Changes
- Ray Train and Ray Tune Checkpoints: Introduced a new
train.Checkpoint
class that unifies interaction with remote storage such as S3, GS, and HDFS. The changes follow the proposal in [REP35] Consolidated persistence API for Ray Train/Tune (#38452, #38481, #38581, #38626, #38864, #38844) - Ray Train with PyTorch Lightning: Moving away from the LightningTrainer in favor of the TorchTrainer as the recommended way of running distributed PyTorch Lightning. The changes follow the proposal outlined in [REP37] [Train] Unify Torch based Trainers on the TorchTrainer API (#37989)
- Ray Train with Hugging Face Transformers/Accelerate: Moving away from the TransformersTrainer/AccelerateTrainer in favor of the TorchTrainer as the recommended way of running distributed Hugging Face Transformers and Accelerate. The changes follow the proposal outlined in [REP37] [Train] Unify Torch based Trainers on the TorchTrainer API (#38083, #38295)
- Deprecated
preprocessor
arg toTrainer
(#38640) - Removed deprecated
Result.log_dir
(#38794)
💫Enhancements:
- Various improvements and fixes for the console output of Ray Train and Tune (#37572, #37571, #37570, #37569, #37531, #36993)
- Raise actionable error message for missing dependencies (#38497)
- Use posix paths throughout library code (#38319)
- Group consecutive workers by IP (#38490)
- Split all Ray Datasets by default (#38694)
- Add static Trainer methods for getting tree-based models (#38344)
- Don't set rank-specific local directories for Train workers (#38007)
🔨 Fixes:
- Fix trainer restoration from S3 (#38251)
🏗 Architecture refactoring:
- Updated internal usage of the new Checkpoint API (#38853, #38804, #38697, #38695, #38757, #38648, #38598, #38617, #38554, #38586, #38523, #38456, #38507, #38491, #38382, #38355, #38284, #38128, #38143, #38227, #38141, #38057, #38104, #37888, #37991, #37962, #37925, #37906, #37690, #37543, #37475, #37142, #38855, #38807, #38818, #39515, #39468, #39368, #39195, #39105, #38563, #38770, #38759, #38767, #38715, #38709, #38478, #38550, #37909, #37613, #38876, #38868, #38736, #38871, #38820, #38457)
📖Documentation:
- Restructured the Ray Train documentation to make it easier to find relevant content (#37892, #38287, #38417, #38359)
- Improved examples, references, and navigation items (#38049, #38084, #38108, #37921, #38391, #38519, #38542, #38541, #38513, #39510, #37588, #37295, #38600, #38582, #38276, #38686, #38537, #38237, #37016)
- Removed outdated examples (#38682, #38696, #38656, #38374, #38377, #38441, #37673, #37657, #37067)
Ray Tune
🤝 API Changes
- Ray Train and Ray Tune Checkpoints: Introduced a new
train.Checkpoint
class that unifies interaction with remote storage such as S3, GS, and HDFS. The changes follow the proposal in [REP35] Consolidated persistence API for Ray Train/Tune (#38452, #38481, #38581, #38626, #38864, #38844) - Removed deprecated
Result.log_dir
(#38794)
💫Enhancements:
- Various improvements and fixes for the console output of Ray Train and Tune (#37572, #37571, #37570, #37569, #37531, #36993)
- Raise actionable error message for missing dependencies (#38497)
- Use posix paths throughout library code (#38319)
- Improved the PyTorchLightning integration (#38883, #37989, #37387, #37400)
- Improved the XGBoost/LightGBM integrations (#38558, #38828)
🔨 Fixes:
- Fix hyperband r calculation and stopping (#39157)
- Replace deprecated np.bool8 (#38495)
- Miscellaneous refactors and fixes (#38165, #37506, #37181, #37173)
🏗 Architecture refactoring:
- Updated internal usages of the new Checkpoint API (#38853, #38804, #38697, #38695, #38757, #38648, #38598, #38617, #38554, #38586, #38523, #38456, #38507, #38491, #38382, #38355, #38284, #38128, #38143, #38227, #38141, #38057, #38104, #37888, #37991, #37962, #37925, #37906, #37690, #37543, #37475, #37142, #38855, #38807, #38818, #39515, #39468, #39368, #39195, #39105, #38563, #38770, #38759, #38767, #38715, #38709, #38478, #38550, #37909, #37613, #38876, #38868, #38736, #38871, #38820, #38457)
- Removed legacy TrialRunner/Executor (#37927)
Ray Serve
🎉 New Features:
- Added keep_alive_timeout_s to Serve config file to allow users to configure HTTP proxy’s duration to keep idle connections alive when no requests are ongoing.
- Added gRPC proxy to serve gRPC requests through Ray Serve. It comes with feature parity with HTTP while offering better performance. Also, replaces the previous experimental gRPC direct ingress.
- Ray 2.7 introduces a new
DeploymentHandle
API that will replace the existingRayServeHandle
andRayServeSyncHandle
APIs in a future release. You are encoura...