Releases: ray-project/ray
Ray-2.0.1
The Ray 2.0.1 patch release contains dependency upgrades and fixes for multiple components:
- Upgrade grpcio version to 1.32 (#28025)
- Upgrade redis version to 7.0.5 (#28936)
- Fix segfault when using runtime environments (#28409)
- Increase RPC timeout for dashboard (#28330)
- Set correct path when using
python -m
(#28140) - [Autoscaler] Fix autoscaling for 0 CPU head node (#26813)
- [Serve] Allow code in private remote Git URIs to be imported (#28250)
- [Serve] Allow
host
andport
in Serve config (#27026) - [RLlib] Evaluation supports asynchronous rollout (single slow eval worker will not block the overall evaluation progress). (#27390)
- [Tune] Fix hang during checkpoint synchronization (#28155)
- [Tune] Fix trial restoration from different IP (#28470)
- [Tune] Fix custom synchronizer serialization (#28699)
- [Workflows] Replace deprecated
name
option withtask_id
(#28151)
Ray-2.0.0
Release Highlights
Ray 2.0 is an exciting release with enhancements to all libraries in the Ray ecosystem. With this major release, we take strides towards our goal of making distributed computing scalable, unified, and open.
Towards these goals, Ray 2.0 features new capabilities for unifying the machine learning (ML) ecosystem, improving Ray's production support, and making it easier than ever for ML practitioners to use Ray's libraries.
Highlights:
- Ray AIR, a scalable and unified toolkit for ML applications, is now in Beta.
- Ray now supports natively shuffling 100TB or more of data with the Ray Datasets library.
- KubeRay, a toolkit for running Ray on Kubernetes, is now in Beta. This replaces the legacy Python-based Ray operator.
- Ray Serve’s Deployment Graph API is a new and easier way to build, test, and deploy an inference graph of deployments. This is released as Beta in 2.0.
A migration guide for all the different libraries can be found here: Ray 2.0 Migration Guide.
Ray Libraries
Ray AIR
Ray AIR is now in beta. Ray AIR builds upon Ray’s libraries to enable end-to-end machine learning workflows and applications on Ray. You can install all dependencies needed for Ray AIR via pip install -u "ray[air]"
.
🎉 New Features:
- Predictors:
- BatchPredictors now have support for scalable inference on GPUs.
- All Predictors can now be constructed from pre-trained models, allowing you to easily scale batch inference with trained models from common ML frameworks.
- ray.ml.predictors has been moved to the Ray Train namespace (ray.train).
- Preprocessing: New preprocessors and API changes on Ray Datasets now make feature processing easier to do on AIR. See the Ray Data release notes for more details.
- New features for Datasets/Train/Tune/Serve can be found in the corresponding library release notes for more details.
💫 Enhancements:
- Major package refactoring is included in this release.
- ray.ml is renamed to ray.air.
- ray.ml.preprocessors have been moved to ray.data.
- train_test_split is now a new method of ray.data.Dataset (#27065)
- ray.ml.trainers have been moved to ray.train (#25570)
- ray.ml.predictors has been moved to ray.train.
- ray.ml.config has been moved to ray.air.config (#25712).
- Checkpoints are now framework-specific -- meaning that each Trainer generates its own Framework-specific Checkpoint class. See Ray Train for more details.
- ModelWrappers have been renamed to PredictorDeployments.
- API stability annotations have been added (#25485)
- Train/Tune now have the same reporting and checkpointing API -- see the Train notes for more details (#26303)
- ScalingConfigs are now Dataclasses not Dict types
- Many AIR examples, benchmarks, and documentation pages were added in this release. The Ray AIR documentation will cover breadth of usage (end to end workflows across different libraries) while library-specific documentation will cover depth (specific features of a specific library).
🔨 Fixes:
- Many documentation examples were previously untested. This release fixes those examples and adds them to the CI.
- Predictors:
- Torch/Tensorflow Predictors have correctness fixes (#25199, #25190, #25138, #25136)
- Update
KerasCallback
to work withTensorflowPredictor
(#26089) - Add streaming BatchPredictor support (#25693)
- Add
predict_pandas
implementation (#25534) - Add
_predict_arrow
interface for Predictor (#25579) - Allow creating Predictor directly from a UDF (#26603)
- Execute GPU inference in a separate stage in BatchPredictor (#26616, #27232, #27398)
- Accessors for preprocessor in Predictor class (#26600)
- [AIR] Predictor
call_model
API for unsupported output types (#26845)
Ray Data Processing
🎉 New Features:
- Add ImageFolderDatasource (#24641)
- Add the NumPy batch format for batch mapping and batch consumption (#24870)
- Add iter_torch_batches() and iter_tf_batches() APIs (#26689)
- Add local shuffling API to iterators (#26094)
- Add drop_columns() API (#26200)
- Add randomize_block_order() API (#25568)
- Add random_sample() API (#24492)
- Add support for len(Dataset) (#25152)
- Add UDF passthrough args to map_batches() (#25613)
- Add Concatenator preprocessor (#26526)
- Change range_arrow() API to range_table() (#24704)
💫 Enhancements:
- Autodetect dataset parallelism based on available resources and data size (#25883)
- Use polars for sorting (#25454)
- Support tensor columns in to_tf() and to_torch() (#24752)
- Add explicit resource allocation option via a top-level scheduling strategy (#24438)
- Spread actor pool actors evenly across the cluster by default (#25705)
- Add ray_remote_args to read_text() (#23764)
- Add max_epoch argument to iter_epochs() (#25263)
- Add Pandas-native groupby and sorting (#26313)
- Support push-based shuffle in groupby operations (#25910)
- More aggressive memory releasing for Dataset and DatasetPipeline (#25461, #25820, #26902, #26650)
- Automatically cast tensor columns on Pandas UDF outputs (#26924)
- Better error messages when reading from S3 (#26619, #26669, #26789)
- Make dataset splitting more efficient and stable (#26641, #26768, #26778)
- Use sampling to estimate in-memory data size for Parquet data source (#26868)
- De-experimentalized lazy execution mode (#26934)
🔨 Fixes:
- Fix pipeline pre-repeat caching (#25265)
- Fix stats construction for from_*() APIs (#25601)
- Fixes label tensor squeezing in to_tf() (#25553)
- Fix stage fusion between equivalent resource args (fixes BatchPredictor) (#25706)
- Fix tensor extension string formatting (repr) (#25768)
- Workaround for unserializable Arrow JSON ReadOptions (#25821)
- Make ActorPoolStrategy kill pool of actors if exception is raised (#25803)
- Fix max number of actors for default actor pool strategy (#26266)
- Fix byte size calculation for non-trivial tensors (#25264)
Ray Train
Ray Train has received a major expansion of scope with Ray 2.0.
In particular, the Ray Train module now contains:
- Trainers
- Predictors
- Checkpoints
for common different ML frameworks including Pytorch, Tensorflow, XGBoost, LightGBM, HuggingFace, and Scikit-Learn. These API help provide end-to-end usage of Ray libraries in Ray AIR workflows.
🎉 New Features:
- The Trainer API is now deprecated for the new Ray AIR Trainers API. Trainers for Pytorch, Tensorflow, Horovod, XGBoost, and LightGBM are now in Beta. (#25570)
- ML framework-specific Predictors have been moved into the
ray.train
namespace. This provides streamlined API for offline and online inference of Pytorch, Tensorflow, XGBoost models and more. (#25769 #26215, #26251, #26451, #26531, #26600, #26603, #26616, #26845) - ML framework-specific checkpoints are introduced. Checkpoints are consumed by Predictors to load model weights and information. (#26777, #25940, #26532, #26534)
💫 Enhancements:
- Train and Tune now use the same reporting and checkpointing API (#24772, #25558)
- Add tunable ScalingConfig dataclass (#25712)
- Randomize block order by default to avoid hotspots (#25870)
- Improve checkpoint configurability and extend results (#25943)
- Improve prepare_data_loader to support multiple batch data types (#26386)
- Discard returns of train loops in Trainers (#26448)
- Clean up logs, reprs, warning s(#26259, #26906, #26988, #27228, #27519)
📖 Documentation:
- Update documentation to use new Train API (#25735)
- Update documentation to use session API (#26051, #26303)
- Add Trainer user guide and update Trainer docs (#27570, #27644, #27685)
- Add Predictor documentation (#25833)
- Replace to_torch with iter_torch_batches (#27656)
- Replace to_tf with iter_tf_batches (#27768)
- Minor doc fixes (#25773, #27955)
🏗 Architecture refactoring:
🔨 Fixes:
- An issue with GPU ID detection and assignment was fixed. (#26493)
- Fix AMP for models with a custom
__getstate__
method (#25335) - Fix transformers example for multi-gpu (#24832)
- Fix ScalingConfig key validation (#25549)
- Fix ResourceChangingScheduler integration (#26307)
- Fix auto_transfer cuda device (#26819)
- Fix BatchPredictor.predict_pipelined not working with GPU stage (#27398)
- Remove rllib dependency from tensorflow_predictor (#27688)
Ray Tune
🎉 New Features:
- The Tuner API is the new way of running Ray Tune experiments. (#26987, #26987, #26961, #26931, #26884, #26930)
- Ray Tune and Ray Train now have the same API for reporting (#25558)
- Introduce tune.with_resources() to specify function trainable resources (#26830)
- Add Tune benchmark for AIR (#26763, #26564)
- Allow Tuner().restore() from cloud URIs (#26963)
- Add top-level imports for Tuner, TuneConfig, move CheckpointConfig (#26882)
- Add resume experiment options to Tuner.restore() (#26826)
- Add checkpoint_frequency/checkpoint_at_end arguments to CheckpointConfig (#26661)
- Add more config arguments to Tuner (#26656)
- Better error message for Tune nested tasks / actors (#25241)
- Allow iterators in tune.grid_search (#25220)
- Add
get_dataframe()
method to result grid, fix config flattening (#24686)
💫 Enhancements:
- Expose number of errored/terminated trials in ResultGrid (#26655)
- remove f...
Ray-1.13.0
Highlights:
- Python 3.10 support is now in alpha.
- Ray usage stats collection is now on by default (guarded by an opt-out prompt).
- Ray Tune can now synchronize Trial data from worker nodes via the object store (without rsync!)
- Ray Workflow comes with a new API and is integrated with Ray DAG.
Ray Autoscaler
💫Enhancements:
- CI tests for KubeRay autoscaler integration (#23365, #23383, #24195)
- Stability enhancements for KubeRay autoscaler integration (#23428)
🔨 Fixes:
- Improved GPU support in KubeRay autoscaler integration (#23383)
- Resources scheduled with the node affinity strategy are not reported to the autoscaler (#24250)
Ray Client
💫Enhancements:
- Add option to configure ray.get with >2 sec timeout (#22165)
- Return
None
from internal KV for non-existent keys (#24058)
🔨 Fixes:
- Fix deadlock by switching to
SimpleQueue
on Python 3.7 and newer in asyncdataclient
(#23995)
Ray Core
🎉 New Features:
- Ray usage stats collection is now on by default (guarded by an opt-out prompt)
- Alpha support for python 3.10 (on Linux and Mac)
- Node affinity scheduling strategy (#23381)
- Add metrics for disk and network I/O (#23546)
- Improve exponential backoff when connecting to the redis (#24150)
- Add the ability to inject a setup hook for customization of runtime_env on init (#24036)
- Add a utility to check GCS / Ray cluster health (#23382)
🔨 Fixes:
- Fixed internal storage S3 bugs (#24167)
- Ensure "get_if_exists" takes effect in the decorator. (#24287)
- Reduce memory usage for Pubsub channels that do not require total memory cap (#23985)
- Add memory buffer limit in publisher for each subscribed entity (#23707)
- Use gRPC instead of socket for GCS client health check (#23939)
- Trim size of Reference struct (#23853)
- Enable debugging into pickle backend (#23854)
🏗 Architecture refactoring:
- Gcs storage interfaces unification (#24211)
- Cleanup pickle5 version check (#23885)
- Simplify options handling (#23882)
- Moved function and actor importer away from pubsub (#24132)
- Replace the legacy ResourceSet & SchedulingResources at Raylet (#23173)
- Unification of AddSpilledUrl and UpdateObjectLocationBatch RPCs (#23872)
- Save task spec in separate table (#22650)
Ray Datasets
🎉 New Features:
- Performance improvement: the aggregation computation is vectorized (#23478)
- Performance improvement: bulk parquet file reading is optimized with the fast metadata provider (#23179)
- Performance improvement: more efficient move semantics for Datasets block processing (#24127)
- Supports Datasets lineage serialization (aka out-of-band serialization) (#23821, #23931, #23932)
- Supports native Tensor views in map processing for pure-tensor datasets (#24812)
- Implemented push-based shuffle (#24281)
🔨 Fixes:
- Documentation improvement: Getting Started page (#24860)
- Documentation improvement: FAQ (#24932)
- Documentation improvement: End to end examples (#24874)
- Documentation improvement: Feature guide - Creating Datasets (#24831)
- Documentation improvement: Feature guide - Saving Datasets (#24987)
- Documentation improvement: Feature guide - Transforming Datasets (#25033)
- Documentation improvement: Datasets APIs docstrings (#24949)
- Performance: fixed block prefetching (#23952)
- Fixed zip() for Pandas dataset (#23532)
🏗 Architecture refactoring:
- Refactored LazyBlockList (#23624)
- Added path-partitioning support for all content types (#23624)
- Added fast metadata provider and refactored Parquet datasource (#24094)
RLlib
🎉 New Features:
- Replay buffer API: First algorithms are using the new replay buffer API, allowing users to define and configure their own custom buffers or use RLlib’s built-in ones: SimpleQ, DQN (#24164, #22842, #23523, #23586)
🏗 Architecture refactoring:
- More algorithms moved into the training iteration function API (no longer using execution plans). Users can now more easily read, develop, and debug RLlib’s algorithms: A2C, APEX-DQN, CQL, DD-PPO, DQN, MARWIL + BC, PPO, QMIX , SAC, SimpleQ, SlateQ, Trainers defined in examples folder. (#22937, #23420, #23673, #24164, #24151, #23735, #24157, #23798, #23906, #24118, #22842, #24166, #23712). This will be fully completed and documented with Ray 2.0.
- Make RolloutWorkers (optionally) recoverable after failure via the new
recreate_failed_workers=True
config flag. (#23739) - POC for new TrainerConfig objects (instead of python config dicts): PPOConfig (for PPOTrainer) and PGConfig (for PGTrainer). (#24295, #23491)
- Hard-deprecate
build_trainer()
(trainer_templates.py): All custom Trainers should now sub-class from any existingTrainer
class. (#23488)
💫Enhancements:
- Add support for complex observations in CQL. (#23332)
- Bandit support for tf2. (#22838)
- Make actions sent by RLlib to the env immutable. (#24262)
- Memory leak finding toolset using tracemalloc + CI memory leak tests. (#15412)
- Enable DD-PPO to run on Windows. (#23673)
🔨 Fixes:
- APPO eager fix (APPOTFPolicy gets wrapped
as_eager()
twice by mistake). (#24268) - CQL gets stuck when deprecated
timesteps_per_iteration
is used (usemin_train_timesteps_per_reporting
instead). (#24345) - SlateQ runs on GPU (torch). (#23464)
- Other bug fixes: #24016, #22050, #23814, #24025, #23740, #23741, #24006, #24005, #24273, #22010, #24271, #23690, #24343, #23419, #23830, #24335, #24148, #21735, #24214, #23818, #24429
Ray Workflow
🎉 New Features:
🔨 Fixes:
- Fix one bug where max_retries is not aligned with ray core’s max_retries. (#22903)
🏗 Architecture refactoring:
- Integrate ray storage in workflow (#24120)
Tune
🎉 New Features:
- Add RemoteTask based sync client (#23605) (rsync not required anymore!)
- Chunk file transfers in cross-node checkpoint syncing (#23804)
- Also interrupt training when SIGUSR1 received (#24015)
- reuse_actors per default for function trainables (#24040)
- Enable AsyncHyperband to continue training for last trials after max_t (#24222)
💫Enhancements:
- Improve testing (#23229
- Improve docstrings (#23375)
- Improve documentation (#23477, #23924)
- Simplify trial executor logic (#23396
- Make
MLflowLoggerUtil
copyable (#23333) - Use new Checkpoint interface internally (#22801)
- Beautify Optional typehints (#23692)
- Improve missing search dependency info (#23691)
- Skip tmp checkpoints in analysis and read iteration from metadata (#23859)
- Treat checkpoints with nan value as worst (#23862)
- Clean up base ProgressReporter API (#24010)
- De-clutter log outputs in trial runner (#24257)
- hyperopt searcher to support tune.choice([[1,2],[3,4]]). (#24181)
🔨Fixes:
- Optuna should ignore additional results after trial termination (#23495)
- Fix PTL multi GPU link (#23589)
- Improve Tune cloud release tests for durable storage (#23277)
- Fix tensorflow distributed trainable docstring (#23590)
- Simplify experiment tag formatting, clean directory names (#23672)
- Don't include nan metrics for best checkpoint (#23820)
- Fix syncing between nodes in placement groups (#23864)
- Fix memory resources for head bundle (#23861)
- Fix empty CSV headers on trial restart (#23860)
- Fix checkpoint sorting with nan values (#23909)
- Make Timeout stopper work after restoring in the future (#24217)
- Small fixes to tune-distributed for new restore modes (#24220)
Train
Most distributed training enhancements will be captured in the new Ray AIR category!
🔨Fixes:
- Copy resources_per_worker to avoid modifying user input
- Fix
train.torch.get_device()
for fractional GPU or multiple GPU per worker case (#23763) - Fix multi node horovod bug (#22564)
- Fully deprecate Ray SGD v1 (#24038)
- Improvements to fault tolerance (#22511)
- MLflow start run under correct experiment (#23662)
- Raise helpful error when required backend isn't installed (#23583)
- Warn pending deprecation for
ray.train.Trainer
andray.tune
DistributedTrainableCreators (#24056)
📖Documentation:
- add FAQ (#22757)
Ray AIR
🎉 New Features:
HuggingFaceTrainer
&HuggingFacePredictor
(#23615, #23876)SklearnTrainer
&SklearnPredictor
(#23803, #23850)HorovodTrainer
(#23437)RLTrainer
&RLPredictor
(#23465, #24172)BatchMapper
preprocessor (#23700)Categorizer
preprocessor (#24180)BatchPredictor
(#23808)
💫Enhancements:
- Add
Checkpoint.as_directory()
for efficient checkpoint fs processing (#23908) - Add
config
toResult
, extendResultGrid.get_best_config
(#23698) - Add Scaling Config validation (#23889)
- Add tuner test. (#23364)
- Move storage handling to pyarrow.fs.FileSystem (#23370)
- Refactor
_get_unique_value_indices
(#24144) - Refactor
most_frequent
SimpleImputer
(#23706) - Set name of Trainable to match with Trainer #23697
- Use checkpoint.as_directory() instead of cleaning up manually (#24113)
- Improve file packing/unpacking (#23621)
- Make Dataset ingest configurable (#24066)
- Remove postprocess_checkpoint (#24297)
🔨Fixes:
- Better exception handling (#23695)
- Do not deepcopy RunConfig (#23499)
- reduce unnecessary stacktrace (#23475)
- Tuner should use
run_config
from Trainer per default (#24079) - Use custom fsspec handler for GS (#24008)
📖Documentation:
Serve
🎉 New Features:
- Serve logging system was revamped! Access log is now turned on by default. (#23558)
- New Gradio notebook example for Ray Serve deployments (#23494)
- Serve now includes full traceback in deployment update error message (#23752)
💫Enhancements:
- Serve Deployment Graph was...
Ray-1.12.1
Patch release with the following fixes:
- Ray now works on Google Colab again! The bug with memory limit fetching when running Ray in a container is now fixed (#23922).
ray-ml
Docker images for CPU will start being built again after they were stopped in Ray 1.9 (#24266).- [Train/Tune] Start MLflow run under the correct experiment for Ray Train and Ray Tune integrations (#23662).
- [RLlib] Fix for APPO in eager mode (#24268).
- [RLlib] Fix Alphastar for TF2 and tracing enabled (c5502b2).
- [Serve] Fix replica leak in anonymous namespaces (#24311).
Ray-1.11.1
Patch release including fixes for the following issues:
Ray-1.12.0
Highlights
- Ray AI Runtime (AIR), an open-source toolkit for building end-to-end ML applications on Ray, is now in Alpha. AIR is an effort to unify the experience of using different Ray libraries (Ray Data, Train, Tune, Serve, RLlib). You can find more information on the docs or on the public RFC.
- Getting involved with Ray AIR. We’ll be holding office hours, development sprints, and other activities as we get closer to the Ray AIR Beta/GA release. Want to join us? Fill out this short form!
- Ray usage data collection is now off by default. If you have any questions or concerns, please comment on the RFC.
- New algorithms are added to RLlib: SlateQ & Bandits (for recommender systems use cases) and AlphaStar (multi-agent, multi-GPU w/ league-based self-play)
- Ray Datasets: new lazy execution model with automatic task fusion and memory-optimizing move semantics; first-class support for Pandas DataFrame blocks; efficient random access datasets.
Ray Autoscaler
🎉 New Features
💫 Enhancements
- Improved documentation and standards around built in autoscaler node providers. (#22236, 22237)
- Improved KubeRay support (#22987, #22847, #22348, #22188)
- Remove redis requirement (#22083)
🔨 Fixes
- No longer print infeasible warnings for internal placement group resources. Placement groups which cannot be satisfied by the autoscaler still trigger warnings. (#22235)
- Default ami’s per AWS region are updated/fixed. (#22506)
- GCP node termination updated (#23101)
- Retry legacy k8s operator on monitor failure (#22792)
- Cap min and max workers for manually managed on-prem clusters (#21710)
- Fix initialization artifacts (#22570)
- Ensure initial scaleup with high upscaling_speed isn't limited. (#21953)
Ray Client
🎉 New Features:
- ray.init has consistent return value in client mode and driver mode #21355
💫Enhancements:
🔨 Fixes:
- Fix ray client object ref releasing in wrong context #22025
Ray Core
🎉 New Features
- RuntimeEnv:
- Support setting timeout for runtime_env setup. (#23082)
- Support setting pip_check and pip_version for runtime_env. (#22826, #23306)
- env_vars will take effect when the pip install command is executed. (temporarily ineffective in conda) (#22730)
- Support strongly-typed API ray.runtime.RuntimeEnv to define runtime env. (#22522)
- Introduce virtualenv to isolate the pip type runtime env. (#21801,#22309)
- Raylet shares fate with the dashboard agent. And the dashboard agent will stay alive when it catches the port conflicts. (#22382,#23024)
- Enable dashboard in the minimal ray installation (#21896)
- Add task and object reconstruction status to ray memory cli tools(#22317)
🔨 Fixes
- Report only memory usage of pinned object copies to improve scaledown. (#22020)
- Scheduler:
- Object store:
- Improve ray stop behavior (#22159)
- Avoid warning when receiving too much logs from a different job (#22102)
- Gcs resource manager bug fix and clean up. (#22462, #22459)
- Release GIL when running
parallel_memcopy()
/memcpy()
during serializations. (#22492) - Fix registering serializer before initializing Ray. (#23031)
🏗 Architecture refactoring
- Ray distributed scheduler refactoring: (#21927, #21992, #22160, #22359, #22722, #22817, #22880, #22893, #22885, #22597, #22857, #23124)
- Removed support for bootstrapping with Redis.
Ray Data Processing
🎉 New Features
- Big Performance and Stability Improvements:
- Add lazy execution mode with automatic stage fusion and optimized memory reclamation via block move semantics (#22233, #22374, #22373, #22476)
- Support for random access datasets, providing efficient random access to rows via binary search (#22749)
- Add automatic round-robin load balancing for reading and shuffle reduce tasks, obviating the need for the
_spread_resource_prefix
hack (#21303)
- More Efficient Tabular Data Wrangling:
- Groupby + Aggregations Improvements:
- Improved Dataset Windowing:
- Better Text I/O:
- New Operations:
- Add
add_column()
utility for adding derived columns (#21967)
- Add
- Support for metadata provider callback for read APIs (#22896)
- Support configuring autoscaling actor pool size (#22574)
🔨 Fixes
- Force lazy datasource materialization in order to respect
DatasetPipeline
stage boundaries (#21970) - Simplify lifetime of designated block owner actor, and don’t create it if dynamic block splitting is disabled (#22007)
- Respect 0 CPU resource request when using manual resource-based load balancing (#22017)
- Remove batch format ambiguity by always converting Arrow batches to Pandas when
batch_format=”native”
is given (#21566) - Fix leaked stats actor handle due to closure capture reference counting bug (#22156)
- Fix boolean tensor column representation and slicing (#22323)
- Fix unhandled empty block edge case in shuffle (#22367)
- Fix unserializable Arrow Partitioning spec (#22477)
- Fix incorrect
iter_epochs()
batch format (#22550) - Fix infinite
iter_epochs()
loop on unconsumed epochs (#22572) - Fix infinite hang on
split()
whennum_shards < num_rows
(#22559) - Patch Parquet file fragment serialization to prevent metadata fetching (#22665)
- Don’t reuse task workers for actors or GPU tasks (#22482)
- Pin pipeline executor actors to driver node to allow for lineage-based fault tolerance for pipelines (#22715)
- Always use non-empty blocks to determine schema (#22834)
- API fix bash (#22886)
- Make label_column optional for
to_tf()
so it can be used for inference (#22916) - Fix
schema()
forDatasetPipeline
s (#23032) - Fix equalized split when
num_splits == num_blocks
(#23191)
💫 Enhancements
- Optimize Parquet metadata serialization via batching (#21963)
- Optimize metadata read/write for Ray Client (#21939)
- Add sanity checks for memory utilization (#22642)
🏗 Architecture refactoring
- Use threadpool to submit
DatasetPipeline
stages (#22912)
RLlib
🎉 New Features
- New “AlphaStar” algorithm: A parallelized, multi-agent/multi-GPU learning algorithm, implementing league-based self-play. (#21356, #21649)
- SlateQ algorithm has been re-tested, upgraded (multi-GPU capable, TensorFlow version), and bug-fixed (added to weekly learning tests). (#22389, #23276, #22544, #22543, #23168, #21827, #22738)
- Bandit algorithms: Moved into
agents
folder as first-class citizens, TensorFlow-Version, unified w/ other agents’ APIs. (#22821, #22028, #22427, #22465, #21949, #21773, #21932, #22421) - ReplayBuffer API (in progress): Allow users to customize and configure their own replay buffers and use these inside custom or built-in algorithms. (#22114, #22390, #21808)
- Datasets support for RLlib: Dataset Reader/Writer and documentation. (#21808, #22239, #21948)
🔨 Fixes
- Fixed memory leak in SimpleReplayBuffer. (#22678)
- Fixed Unity3D built-in examples: Action bounds from -inf/inf to -1.0/1.0. (#22247)
- Various bug fixes. (#22350, #22245, #22171, #21697, #21855, #22076, #22590, #22587, #22657, #22428, #23063, #22619, #22731, #22534, #22074, #22078, #22641, #22684, #22398, #21685)
🏗 Architecture refactoring
- A3C: Moved into new
training_iteration
API (fromexeution_plan
API). Lead to a ~2.7x performance increase on a Atari + CNN + LSTM benchmark. (#22126, #22316) - Make
multiagent->policies_to_train
more flexible via callable option (alternative to providing a list of policy IDs). (#20735)
💫Enhancements:
- Env pre-checking module now active by default. (#22191)
- Callbacks: Added
on_sub_environment_created
andon_trainer_init
callback options. (#21893, #22493) - RecSim environment wrappers: Ability to use google’s RecSim for recommender systems more easily w/ RLlib algorithms (3 RLlib-ready example environments). (#22028, #21773, #22211)
- MARWIL loss function enhancement (exploratory term for stddev). (#21493)
📖Documentation:
- Docs enhancements: Setup-dev instructions; Ray datasets integration. (#22239)
- Other doc enhancements and fixes. (#23160, #23226, #22496, #22489, #22380)
Ray Workflow
🎉 New Features:
- Support skip checkpointing.
🔨 Fixes:
- Fix an issue where the event loop is not set.
Tune
🎉 New Features:
- Expose new checkpoint interface to users (#22741)
💫Enhancemen...
Ray-1.11.0
Highlights
🎉 Ray no longer starts Redis by default. Cluster metadata previously stored in Redis is stored in the GCS now.
Ray Autoscaler
🎉 New Features
- AWS Cloudwatch dashboard support #20266
💫 Enhancements
- Kuberay autoscaler prototype #21086
🔨 Fixes
- Ray.autoscaler.sdk import issue #21795
Ray Core
🎉 New Features
🔨 Fixes
- Better support for nested tasks
- Fixed 16GB mac perf issue by limit the plasma store size to 2GB #21224
- Fix
SchedulingClassInfo.running_tasks
memory leak #21535 - Round robin during spread scheduling #19968
🏗 Architecture refactoring
- Refactor scheduler resource reporting public APIs #21732
- Refactor ObjectManager wait logic to WaitManager #21369
Ray Data Processing
🎉 New Features
- More powerful to_torch() API, providing more control over the GPU batch format. (#21117)
🔨 Fixes
- Fix simple Dataset sort generating only 1 non-empty block. (#21588)
- Improve error handling across sorting, groupbys, and aggregations. (#21610, #21627)
- Fix boolean tensor column representation and slicing. (#22358)
RLlib
🎉 New Features
- Better utils for flattening complex inputs and enable prev-actions for LSTM/attention for complex action spaces. (#21330)
MultiAgentEnv
pre-checker (#21476)- Base env pre-checker. (#21569)
🔨 Fixes
- Better defaults for QMix (#21332)
- Fix contrib/MADDPG + pettingzoo coop-pong-v4. (#21452)
- Fix action unsquashing causes inf/NaN actions for unbounded action spaces. (#21110)
- Ignore PPO KL-loss term completely if kl-coeff == 0.0 to avoid NaN values (#21456)
unsquash_action
andclip_action
(when None) cause wrong actions computed byTrainer.compute_single_action
. (#21553)- Conv2d default filter tests and add default setting for 96x96 image obs space. (#21560)
- Bing back and fix offline RL(BC & MARWIL) learning tests. (#21574, #21643)
- SimpleQ should not use a prio. replay buffer. (#21665)
- Fix video recorder env wrapper. Added test case. (#21670)
🏗 Architecture refactoring
- Decentralized multi-agent learning (#21421)
- Preparatory PR for multi-agent multi-GPU learner (alpha-star style) (#21652)
Ray Workflow
🔨 Fixes
- Fixed workflow recovery issue due to a bug of dynamic output #21571
Tune
🎉 New Features
- It is now possible to load all evaluated points from an experiment into a Searcher (#21506)
- Add CometLoggerCallback (#20766)
💫 Enhancements
- Only sync the checkpoint folder instead of the entire trial folder for cloud checkpoint. (#21658)
- Add test for heterogeneous resource request deadlocks (#21397)
- Remove unused
return_or_clean_cached_pg
(#21403) - Remove
TrialExecutor.resume_trial
(#21225) - Leave only one canonical way of stopping a trial (#21021)
🔨 Fixes
- Replace deprecated
running_sanity_check
withsanity_checking
in PTL integration (#21831) - Fix loading an
ExperimentAnalysis
object without a registeredTrainable
(#21475) - Fix stale node detection bug (#21516)
- Fixes to allow
tune/tests/test_commands.py
to run on Windows (#21342) - Deflake PBT tests (#21366)
- Fix dtype coercion in
tune.choice
(#21270)
📖 Documentation
- Fix typo in
schedulers.rst
(#21777)
Train
🎉 New Features
💫 Enhancements
🔨 Fixes
- Fix Dataloader (#21467)
📖 Documentation
Serve
🎉 New Features
🔨 Fixes
- Warn when serve.start() with different options (#21562)
- Detect http.disconnect and cancel requests properly (#21438)
Thanks
Many thanks to all those who contributed to this release!
@isaac-vidas, @wuisawesome, @stephanie-wang, @jon-chuang, @xwjiang2010, @jjyao, @MissiontoMars, @qbphilip, @yaoyuan97, @gjoliver, @Yard1, @rkooo567, @talesa, @czgdp1807, @DN6, @sven1977, @kfstorm, @krfricke, @simon-mo, @hauntsaninja, @pcmoritz, @JamieSlome, @chaokunyang, @jovany-wang, @sidward14, @DmitriGekhtman, @ericl, @mwtian, @jwyyy, @clarkzinzow, @hckuo, @vakker, @HuangLED, @iycheng, @edoakes, @shrekris-anyscale, @robertnishihara, @avnishn, @mickelliu, @ndrwnaguib, @ijrsvt, @Zyiqin-Miranda, @bveeramani, @SongGuyang, @n30111, @WangTaoTheTonic, @suquark, @richardliaw, @qicosmos, @scv119, @architkulkarni, @lixin-wei, @Catch-Bull, @acxz, @benblack769, @clay4444, @amogkam, @marin-ma, @maxpumperla, @jiaodong, @mattip, @isra17, @raulchen, @wilsonwang371, @carlogrisetti, @ashione, @matthewdeng
Ray-1.10.0
Highlights
- 🎉 Ray Windows support is now in beta – a significant fraction of the Ray test suite is now passing on Windows. We are eager to learn about your experience with Ray 1.10 on Windows, please file issues you encounter at https://github.com/ray-project/ray/issues. In the upcoming releases we will spend more time on making Ray Serve and Runtime Environment tests pass on Windows and on polishing things.
Ray Autoscaler
💫Enhancements:
- Add autoscaler update time to prometheus metrics (#20831)
- Fewer non terminated nodes calls in autoscaler update (#20359, #20623)
🔨 Fixes:
- GCP TPU autoscaling fix (#20311)
- Scale-down stability fix (#21204)
- Report node launch failure in driver logs (#20814)
Ray Client
💫Enhancements
- Client task options are encoded with pickle instead of json (#20930)
Ray Core
🎉 New Features:
runtime_env
’spip
field now installs pip packages in your existing environment instead of installing them in a new isolated environment. (#20341)
🔨 Fixes:
- Fix bug where specifying runtime_env conda/pip per-job using local requirements file using Ray Client on a remote cluster didn’t work (#20855)
- Security fixes for
log4j2
– thelog4j2
version has been bumped to 2.17.1 (#21373)
💫Enhancements:
- Allow runtime_env working_dir and py_modules to be pathlib.Path type (#20853, #20810)
- Add environment variable to skip local runtime_env garbage collection (#21163)
- Change runtime_env error log to debug log (#20875)
- Improved reference counting for runtime_env resources (#20789)
🏗 Architecture refactoring:
- Refactor runtime_env to use protobuf for multi-language support (#19511)
📖Documentation:
Ray Data Processing
🎉 New Features:
- Added stats framework for debugging Datasets performance (#20867, #21070)
- [Dask-on-Ray] New config helper for enabling the Dask-on-Ray scheduler (#21114)
💫Enhancements:
- Reduce memory usage during when converting to a Pandas DataFrame (#20921)
🔨 Fixes:
- Fix slow block evaluation when splitting (#20693)
- Fix boundary sampling concatenation on non-uniform blocks (#20784)
- Fix boolean tensor column slicing (#20905)
🏗 Architecture refactoring:
- Refactor table block structure to support more tabular block formats (#20721)
RLlib
🎉 New Features:
- Support for RE3 exploration algorithm (for tf only). (#19551)
- Environment pre-checks, better failure behavior and enhanced environment API. (#20481, #20832, #20868, #20785, #21027, #20811)
🏗 Architecture refactoring:
- Evaluation: Support evaluation setting that makes sure
train
doesn't ever have to wait foreval
to finish (b/c of long episodes). (#20757); Always attach latest eval metrics. (#21011) - Soft-deprecate
build_trainer()
utility function in favor of sub-classingTrainer
directly (and overriding some of its methods). (#20635, #20636, #20633, #20424, #20570, #20571, #20639, #20725) - Experimental no-flatten option for actions/prev-actions. (#20918)
- Use
SampleBatch
instead of an input dict whenever possible. (#20746) - Switch off
Preprocessors
by default forPGTrainer
(experimental). (#21008) - Toward a Replay Buffer API (cleanups; docstrings; renames; move into
rllib/execution/buffers
dir) (#20552)
📖Documentation:
- Overhaul of auto-API reference pages. (#19786, #20537, #20538, #20486, #20250)
- README and RLlib landing page overhaul (#20249).
- Added example containing code to compute an adapted (time-dependent) GAE used by the PPO algorithm (#20850).
🔨 Fixes:
Tune
🎉 New Features:
- Introduce TrialCheckpoint class, making checkpoint down/upload easie (#20585)
- Add random state to
BasicVariantGenerator
(#20926) - Multi-objective support for Optuna (#20489)
💫Enhancements:
- Add
set_max_concurrency
to Searcher API (#20576) - Allow for tuples in _split_resolved_unresolved_values. (#20794)
- Show the name of training func, instead of just ImplicitFunction. (#21029)
- Enforce one future at a time for any given trial at any given time. (#20783)
moveon_no_available_trials
to a subclass underrunner
(#20809) - Clean up code (#20555, #20464, #20403, #20653, #20796, #20916, #21067)
- Start restricting TrialRunner/Executor interface exposures. (#20656)
- TrialExecutor should not take in Runner interface. (#20655)
🔨Fixes:
- Deflake test_tune_restore.py (#20776)
- Fix best_trial_str for nested custom parameter columns (#21078)
- Fix checkpointing error message on K8s (#20559)
- Fix testResourceScheduler and testMultiStepRun. (#20872)
- Fix tune cloud tests for function and rllib trainables (#20536)
- Move _head_bundle_is_empty after conversion (#21039)
- Elongate test_trial_scheduler_pbt timeout. (#21120)
Train
🔨Fixes:
- Ray Train environment variables are automatically propagated and do not need to be manually set on every node (#20523)
- Various minor fixes and improvements (#20952, #20893, #20603, #20487)
📖Documentation: - Update saving/loading checkpoint docs (#20973). Thanks @jwyyy!
- Various minor doc updates (#20877, #20683)
Serve
💫Enhancements:
- Add validation to Serve AutoscalingConfig class (#20779)
- Add Serve metric for HTTP error codes (#21009)
🔨Fixes:
- No longer create placement group for deployment with no resources (#20471)
- Log errors in deployment initialization/configuration user code (#20620)
Jobs
🎉 New Features:
- Logs can be streamed from job submission server with
ray job logs
command (#20976) - Add documentation for ray job submission (#20530)
- Propagate custom headers field to JobSubmissionClient and apply to all requests (#20663)
🔨Fixes:
- Fix job serve accidentally creates local ray processes instead of connecting (#20705)
💫Enhancements:
- [Jobs] Update CLI examples to use the same setup (#20844)
Thanks
Many thanks to all those who contributed to this release!
@dmatrix, @suquark, @tekumara, @jiaodong, @jovany-wang, @avnishn, @simon-mo, @iycheng, @SongGuyang, @ArturNiederfahrenhorst, @wuisawesome, @kfstorm, @matthewdeng, @jjyao, @chenk008, @Sertingolix, @larrylian, @czgdp1807, @scv119, @duburcqa, @runedog48, @Yard1, @robertnishihara, @geraint0923, @amogkam, @DmitriGekhtman, @ijrsvt, @kk-55, @lixin-wei, @mvindiola1, @hauntsaninja, @sven1977, @Hankpipi, @qbphilip, @hckuo, @newmanwang, @clay4444, @edoakes, @liuyang-my, @iasoon, @WangTaoTheTonic, @fgogolli, @dproctor, @gramhagen, @krfricke, @richardliaw, @bveeramani, @pcmoritz, @ericl, @simonsays1980, @carlogrisetti, @stephanie-wang, @AmeerHajAli, @mwtian, @xwjiang2010, @shrekris-anyscale, @n30111, @lchu-ibm, @Scalsol, @seonggwonyoon, @gjoliver, @qicosmos, @xychu, @iamhatesz, @architkulkarni, @jwyyy, @rkooo567, @mattip, @ckw017, @MissiontoMars, @clarkzinzow
Ray-1.9.2
Patch release to bump the log4j
version from 2.16.0
to 2.17.0
. This resolves the security issue CVE-2021-45105.
Ray-1.9.1
Patch release to bump the log4j2
version from 2.14
to 2.16
. This resolves the security vulnerabilities https://nvd.nist.gov/vuln/detail/CVE-2021-44228 and https://nvd.nist.gov/vuln/detail/CVE-2021-45046.
No library or core changes included.
Thanks @seonggwonyoon and @ijrsvt for contributing the fixes!