22 Oct 04:49

matthewdeng

03b6bc7

Ray-2.0.1

The Ray 2.0.1 patch release contains dependency upgrades and fixes for multiple components:

Upgrade grpcio version to 1.32 (#28025)
Upgrade redis version to 7.0.5 (#28936)
Fix segfault when using runtime environments (#28409)
Increase RPC timeout for dashboard (#28330)
Set correct path when using python -m (#28140)
[Autoscaler] Fix autoscaling for 0 CPU head node (#26813)
[Serve] Allow code in private remote Git URIs to be imported (#28250)
[Serve] Allow host and port in Serve config (#27026)
[RLlib] Evaluation supports asynchronous rollout (single slow eval worker will not block the overall evaluation progress). (#27390)
[Tune] Fix hang during checkpoint synchronization (#28155)
[Tune] Fix trial restoration from different IP (#28470)
[Tune] Fix custom synchronizer serialization (#28699)
[Workflows] Replace deprecated name option with task_id (#28151)

Assets 2

23 Aug 04:57

scv119

ray-2.0.0

cba26cc

Ray-2.0.0

Release Highlights

Ray 2.0 is an exciting release with enhancements to all libraries in the Ray ecosystem. With this major release, we take strides towards our goal of making distributed computing scalable, unified, and open.

Towards these goals, Ray 2.0 features new capabilities for unifying the machine learning (ML) ecosystem, improving Ray's production support, and making it easier than ever for ML practitioners to use Ray's libraries.

Highlights:

Ray AIR, a scalable and unified toolkit for ML applications, is now in Beta.
Ray now supports natively shuffling 100TB or more of data with the Ray Datasets library.
KubeRay, a toolkit for running Ray on Kubernetes, is now in Beta. This replaces the legacy Python-based Ray operator.
Ray Serve’s Deployment Graph API is a new and easier way to build, test, and deploy an inference graph of deployments. This is released as Beta in 2.0.

A migration guide for all the different libraries can be found here: Ray 2.0 Migration Guide.

Ray Libraries

Ray AIR

Ray AIR is now in beta. Ray AIR builds upon Ray’s libraries to enable end-to-end machine learning workflows and applications on Ray. You can install all dependencies needed for Ray AIR via pip install -u "ray[air]".

🎉 New Features:

Predictors:
- BatchPredictors now have support for scalable inference on GPUs.
- All Predictors can now be constructed from pre-trained models, allowing you to easily scale batch inference with trained models from common ML frameworks.
- ray.ml.predictors has been moved to the Ray Train namespace (ray.train).
Preprocessing: New preprocessors and API changes on Ray Datasets now make feature processing easier to do on AIR. See the Ray Data release notes for more details.
New features for Datasets/Train/Tune/Serve can be found in the corresponding library release notes for more details.

💫 Enhancements:

Major package refactoring is included in this release.
- ray.ml is renamed to ray.air.
- ray.ml.preprocessors have been moved to ray.data.
  - train_test_split is now a new method of ray.data.Dataset (#27065)
- ray.ml.trainers have been moved to ray.train (#25570)
- ray.ml.predictors has been moved to ray.train.
- ray.ml.config has been moved to ray.air.config (#25712).
- Checkpoints are now framework-specific -- meaning that each Trainer generates its own Framework-specific Checkpoint class. See Ray Train for more details.
- ModelWrappers have been renamed to PredictorDeployments.
API stability annotations have been added (#25485)
Train/Tune now have the same reporting and checkpointing API -- see the Train notes for more details (#26303)
ScalingConfigs are now Dataclasses not Dict types
Many AIR examples, benchmarks, and documentation pages were added in this release. The Ray AIR documentation will cover breadth of usage (end to end workflows across different libraries) while library-specific documentation will cover depth (specific features of a specific library).

🔨 Fixes:

Many documentation examples were previously untested. This release fixes those examples and adds them to the CI.
Predictors:
- Torch/Tensorflow Predictors have correctness fixes (#25199, #25190, #25138, #25136)
- Update KerasCallback to work with TensorflowPredictor (#26089)
- Add streaming BatchPredictor support (#25693)
- Add predict_pandas implementation (#25534)
- Add _predict_arrow interface for Predictor (#25579)
- Allow creating Predictor directly from a UDF (#26603)
- Execute GPU inference in a separate stage in BatchPredictor (#26616, #27232, #27398)
- Accessors for preprocessor in Predictor class (#26600)
- [AIR] Predictor call_model API for unsupported output types (#26845)

Ray Data Processing

🎉 New Features:

Add ImageFolderDatasource (#24641)
Add the NumPy batch format for batch mapping and batch consumption (#24870)
Add iter_torch_batches() and iter_tf_batches() APIs (#26689)
Add local shuffling API to iterators (#26094)
Add drop_columns() API (#26200)
Add randomize_block_order() API (#25568)
Add random_sample() API (#24492)
Add support for len(Dataset) (#25152)
Add UDF passthrough args to map_batches() (#25613)
Add Concatenator preprocessor (#26526)
Change range_arrow() API to range_table() (#24704)

💫 Enhancements:

Autodetect dataset parallelism based on available resources and data size (#25883)
Use polars for sorting (#25454)
Support tensor columns in to_tf() and to_torch() (#24752)
Add explicit resource allocation option via a top-level scheduling strategy (#24438)
Spread actor pool actors evenly across the cluster by default (#25705)
Add ray_remote_args to read_text() (#23764)
Add max_epoch argument to iter_epochs() (#25263)
Add Pandas-native groupby and sorting (#26313)
Support push-based shuffle in groupby operations (#25910)
More aggressive memory releasing for Dataset and DatasetPipeline (#25461, #25820, #26902, #26650)
Automatically cast tensor columns on Pandas UDF outputs (#26924)
Better error messages when reading from S3 (#26619, #26669, #26789)
Make dataset splitting more efficient and stable (#26641, #26768, #26778)
Use sampling to estimate in-memory data size for Parquet data source (#26868)
De-experimentalized lazy execution mode (#26934)

🔨 Fixes:

Fix pipeline pre-repeat caching (#25265)
Fix stats construction for from_*() APIs (#25601)
Fixes label tensor squeezing in to_tf() (#25553)
Fix stage fusion between equivalent resource args (fixes BatchPredictor) (#25706)
Fix tensor extension string formatting (repr) (#25768)
Workaround for unserializable Arrow JSON ReadOptions (#25821)
Make ActorPoolStrategy kill pool of actors if exception is raised (#25803)
Fix max number of actors for default actor pool strategy (#26266)
Fix byte size calculation for non-trivial tensors (#25264)

Ray Train

Ray Train has received a major expansion of scope with Ray 2.0.

In particular, the Ray Train module now contains:

Trainers
Predictors
Checkpoints

for common different ML frameworks including Pytorch, Tensorflow, XGBoost, LightGBM, HuggingFace, and Scikit-Learn. These API help provide end-to-end usage of Ray libraries in Ray AIR workflows.

🎉 New Features:

The Trainer API is now deprecated for the new Ray AIR Trainers API. Trainers for Pytorch, Tensorflow, Horovod, XGBoost, and LightGBM are now in Beta. (#25570)
ML framework-specific Predictors have been moved into the ray.train namespace. This provides streamlined API for offline and online inference of Pytorch, Tensorflow, XGBoost models and more. (#25769 #26215, #26251, #26451, #26531, #26600, #26603, #26616, #26845)
ML framework-specific checkpoints are introduced. Checkpoints are consumed by Predictors to load model weights and information. (#26777, #25940, #26532, #26534)

💫 Enhancements:

Train and Tune now use the same reporting and checkpointing API (#24772, #25558)
Add tunable ScalingConfig dataclass (#25712)
Randomize block order by default to avoid hotspots (#25870)
Improve checkpoint configurability and extend results (#25943)
Improve prepare_data_loader to support multiple batch data types (#26386)
Discard returns of train loops in Trainers (#26448)
Clean up logs, reprs, warning s(#26259, #26906, #26988, #27228, #27519)

📖 Documentation:

Update documentation to use new Train API (#25735)
Update documentation to use session API (#26051, #26303)
Add Trainer user guide and update Trainer docs (#27570, #27644, #27685)
Add Predictor documentation (#25833)
Replace to_torch with iter_torch_batches (#27656)
Replace to_tf with iter_tf_batches (#27768)
Minor doc fixes (#25773, #27955)

🏗 Architecture refactoring:

Clean up ray.train package (#25566)
Mark Trainer interfaces as Deprecated (#25573)

🔨 Fixes:

An issue with GPU ID detection and assignment was fixed. (#26493)
Fix AMP for models with a custom __getstate__ method (#25335)
Fix transformers example for multi-gpu (#24832)
Fix ScalingConfig key validation (#25549)
Fix ResourceChangingScheduler integration (#26307)
Fix auto_transfer cuda device (#26819)
Fix BatchPredictor.predict_pipelined not working with GPU stage (#27398)
Remove rllib dependency from tensorflow_predictor (#27688)

Ray Tune

🎉 New Features:

The Tuner API is the new way of running Ray Tune experiments. (#26987, #26987, #26961, #26931, #26884, #26930)
Ray Tune and Ray Train now have the same API for reporting (#25558)
Introduce tune.with_resources() to specify function trainable resources (#26830)
Add Tune benchmark for AIR (#26763, #26564)
Allow Tuner().restore() from cloud URIs (#26963)
Add top-level imports for Tuner, TuneConfig, move CheckpointConfig (#26882)
Add resume experiment options to Tuner.restore() (#26826)
Add checkpoint_frequency/checkpoint_at_end arguments to CheckpointConfig (#26661)
Add more config arguments to Tuner (#26656)
Better error message for Tune nested tasks / actors (#25241)
Allow iterators in tune.grid_search (#25220)
Add get_dataframe() method to result grid, fix config flattening (#24686)

💫 Enhancements:

Expose number of errored/terminated trials in ResultGrid (#26655)
remove f...

Contributors

ghost, ericl, and 130 other contributors

Assets 2

09 Jun 17:15

avnishn

ray-1.13.0

e4ce38d

Ray-1.13.0

Highlights:

Python 3.10 support is now in alpha.
Ray usage stats collection is now on by default (guarded by an opt-out prompt).
Ray Tune can now synchronize Trial data from worker nodes via the object store (without rsync!)
Ray Workflow comes with a new API and is integrated with Ray DAG.

Ray Autoscaler

💫Enhancements:

CI tests for KubeRay autoscaler integration (#23365, #23383, #24195)
Stability enhancements for KubeRay autoscaler integration (#23428)

🔨 Fixes:

Improved GPU support in KubeRay autoscaler integration (#23383)
Resources scheduled with the node affinity strategy are not reported to the autoscaler (#24250)

Ray Client

💫Enhancements:

Add option to configure ray.get with >2 sec timeout (#22165)
Return None from internal KV for non-existent keys (#24058)

🔨 Fixes:

Fix deadlock by switching to SimpleQueue on Python 3.7 and newer in async dataclient (#23995)

Ray Core

🎉 New Features:

Ray usage stats collection is now on by default (guarded by an opt-out prompt)
Alpha support for python 3.10 (on Linux and Mac)
Node affinity scheduling strategy (#23381)
Add metrics for disk and network I/O (#23546)
Improve exponential backoff when connecting to the redis (#24150)
Add the ability to inject a setup hook for customization of runtime_env on init (#24036)
Add a utility to check GCS / Ray cluster health (#23382)

🔨 Fixes:

Fixed internal storage S3 bugs (#24167)
Ensure "get_if_exists" takes effect in the decorator. (#24287)
Reduce memory usage for Pubsub channels that do not require total memory cap (#23985)
Add memory buffer limit in publisher for each subscribed entity (#23707)
Use gRPC instead of socket for GCS client health check (#23939)
Trim size of Reference struct (#23853)
Enable debugging into pickle backend (#23854)

🏗 Architecture refactoring:

Gcs storage interfaces unification (#24211)
Cleanup pickle5 version check (#23885)
Simplify options handling (#23882)
Moved function and actor importer away from pubsub (#24132)
Replace the legacy ResourceSet & SchedulingResources at Raylet (#23173)
Unification of AddSpilledUrl and UpdateObjectLocationBatch RPCs (#23872)
Save task spec in separate table (#22650)

Ray Datasets

🎉 New Features:

Performance improvement: the aggregation computation is vectorized (#23478)
Performance improvement: bulk parquet file reading is optimized with the fast metadata provider (#23179)
Performance improvement: more efficient move semantics for Datasets block processing (#24127)
Supports Datasets lineage serialization (aka out-of-band serialization) (#23821, #23931, #23932)
Supports native Tensor views in map processing for pure-tensor datasets (#24812)
Implemented push-based shuffle (#24281)

🔨 Fixes:

Documentation improvement: Getting Started page (#24860)
Documentation improvement: FAQ (#24932)
Documentation improvement: End to end examples (#24874)
Documentation improvement: Feature guide - Creating Datasets (#24831)
Documentation improvement: Feature guide - Saving Datasets (#24987)
Documentation improvement: Feature guide - Transforming Datasets (#25033)
Documentation improvement: Datasets APIs docstrings (#24949)
Performance: fixed block prefetching (#23952)
Fixed zip() for Pandas dataset (#23532)

🏗 Architecture refactoring:

Refactored LazyBlockList (#23624)
Added path-partitioning support for all content types (#23624)
Added fast metadata provider and refactored Parquet datasource (#24094)

RLlib

🎉 New Features:

Replay buffer API: First algorithms are using the new replay buffer API, allowing users to define and configure their own custom buffers or use RLlib’s built-in ones: SimpleQ, DQN (#24164, #22842, #23523, #23586)

🏗 Architecture refactoring:

More algorithms moved into the training iteration function API (no longer using execution plans). Users can now more easily read, develop, and debug RLlib’s algorithms: A2C, APEX-DQN, CQL, DD-PPO, DQN, MARWIL + BC, PPO, QMIX , SAC, SimpleQ, SlateQ, Trainers defined in examples folder. (#22937, #23420, #23673, #24164, #24151, #23735, #24157, #23798, #23906, #24118, #22842, #24166, #23712). This will be fully completed and documented with Ray 2.0.
Make RolloutWorkers (optionally) recoverable after failure via the new recreate_failed_workers=True config flag. (#23739)
POC for new TrainerConfig objects (instead of python config dicts): PPOConfig (for PPOTrainer) and PGConfig (for PGTrainer). (#24295, #23491)
Hard-deprecate build_trainer() (trainer_templates.py): All custom Trainers should now sub-class from any existing Trainer class. (#23488)

💫Enhancements:

Add support for complex observations in CQL. (#23332)
Bandit support for tf2. (#22838)
Make actions sent by RLlib to the env immutable. (#24262)
Memory leak finding toolset using tracemalloc + CI memory leak tests. (#15412)
Enable DD-PPO to run on Windows. (#23673)

🔨 Fixes:

APPO eager fix (APPOTFPolicy gets wrapped as_eager() twice by mistake). (#24268)
CQL gets stuck when deprecated timesteps_per_iteration is used (use min_train_timesteps_per_reporting instead). (#24345)
SlateQ runs on GPU (torch). (#23464)
Other bug fixes: #24016, #22050, #23814, #24025, #23740, #23741, #24006, #24005, #24273, #22010, #24271, #23690, #24343, #23419, #23830, #24335, #24148, #21735, #24214, #23818, #24429

Ray Workflow

🎉 New Features:

Workflow step is deprecated (#23796, #23728, #23456, #24210)

🔨 Fixes:

Fix one bug where max_retries is not aligned with ray core’s max_retries. (#22903)

🏗 Architecture refactoring:

Integrate ray storage in workflow (#24120)

Tune

🎉 New Features:

Add RemoteTask based sync client (#23605) (rsync not required anymore!)
Chunk file transfers in cross-node checkpoint syncing (#23804)
Also interrupt training when SIGUSR1 received (#24015)
reuse_actors per default for function trainables (#24040)
Enable AsyncHyperband to continue training for last trials after max_t (#24222)

💫Enhancements:

Improve testing (#23229
Improve docstrings (#23375)
Improve documentation (#23477, #23924)
Simplify trial executor logic (#23396
Make MLflowLoggerUtil copyable (#23333)
Use new Checkpoint interface internally (#22801)
Beautify Optional typehints (#23692)
Improve missing search dependency info (#23691)
Skip tmp checkpoints in analysis and read iteration from metadata (#23859)
Treat checkpoints with nan value as worst (#23862)
Clean up base ProgressReporter API (#24010)
De-clutter log outputs in trial runner (#24257)
hyperopt searcher to support tune.choice([[1,2],[3,4]]). (#24181)

🔨Fixes:

Optuna should ignore additional results after trial termination (#23495)
Fix PTL multi GPU link (#23589)
Improve Tune cloud release tests for durable storage (#23277)
Fix tensorflow distributed trainable docstring (#23590)
Simplify experiment tag formatting, clean directory names (#23672)
Don't include nan metrics for best checkpoint (#23820)
Fix syncing between nodes in placement groups (#23864)
Fix memory resources for head bundle (#23861)
Fix empty CSV headers on trial restart (#23860)
Fix checkpoint sorting with nan values (#23909)
Make Timeout stopper work after restoring in the future (#24217)
Small fixes to tune-distributed for new restore modes (#24220)

Train

Most distributed training enhancements will be captured in the new Ray AIR category!

🔨Fixes:

Copy resources_per_worker to avoid modifying user input
Fix train.torch.get_device() for fractional GPU or multiple GPU per worker case (#23763)
Fix multi node horovod bug (#22564)
Fully deprecate Ray SGD v1 (#24038)
Improvements to fault tolerance (#22511)
MLflow start run under correct experiment (#23662)
Raise helpful error when required backend isn't installed (#23583)
Warn pending deprecation for ray.train.Trainer and ray.tune DistributedTrainableCreators (#24056)

📖Documentation:

add FAQ (#22757)

Ray AIR

🎉 New Features:

HuggingFaceTrainer & HuggingFacePredictor (#23615, #23876)
SklearnTrainer & SklearnPredictor (#23803, #23850)
HorovodTrainer (#23437)
RLTrainer & RLPredictor (#23465, #24172)
BatchMapper preprocessor (#23700)
Categorizer preprocessor (#24180)
BatchPredictor (#23808)

💫Enhancements:

Add Checkpoint.as_directory() for efficient checkpoint fs processing (#23908)
Add config to Result, extend ResultGrid.get_best_config (#23698)
Add Scaling Config validation (#23889)
Add tuner test. (#23364)
Move storage handling to pyarrow.fs.FileSystem (#23370)
Refactor _get_unique_value_indices (#24144)
Refactor most_frequent SimpleImputer (#23706)
Set name of Trainable to match with Trainer #23697
Use checkpoint.as_directory() instead of cleaning up manually (#24113)
Improve file packing/unpacking (#23621)
Make Dataset ingest configurable (#24066)
Remove postprocess_checkpoint (#24297)

🔨Fixes:

Better exception handling (#23695)
Do not deepcopy RunConfig (#23499)
reduce unnecessary stacktrace (#23475)
Tuner should use run_config from Trainer per default (#24079)
Use custom fsspec handler for GS (#24008)

📖Documentation:

Add distributed torch_geometric example (#23580)
GNN example cleanup (#24080)

Serve

🎉 New Features:

Serve logging system was revamped! Access log is now turned on by default. (#23558)
New Gradio notebook example for Ray Serve deployments (#23494)
Serve now includes full traceback in deployment update error message (#23752)

💫Enhancements:

Serve Deployment Graph was...

Contributors

ericl, pcmoritz, and 82 other contributors

Assets 2

16 May 22:46

amogkam

ray-1.12.1

4863e33

Ray-1.12.1

Patch release with the following fixes:

Ray now works on Google Colab again! The bug with memory limit fetching when running Ray in a container is now fixed (#23922).
ray-ml Docker images for CPU will start being built again after they were stopped in Ray 1.9 (#24266).
[Train/Tune] Start MLflow run under the correct experiment for Ray Train and Ray Tune integrations (#23662).
[RLlib] Fix for APPO in eager mode (#24268).
[RLlib] Fix Alphastar for TF2 and tracing enabled (c5502b2).
[Serve] Fix replica leak in anonymous namespaces (#24311).

Assets 2

10 May 20:48

architkulkarni

ray-1.11.1

5c9b100

Ray-1.11.1

Patch release including fixes for the following issues:

Ray Job Submission not working with remote working_dir URLs in their runtime environment (#22018)
Ray Tune + MLflow integration failing to set MLflow experiment ID (#23662)
Dependencies for gym not pinned, leading to version incompatibility issues (#23705)

Assets 2

08 Apr 03:05

jianoaix

ray-1.12.0

f18fc31

Ray-1.12.0

Highlights

Ray AI Runtime (AIR), an open-source toolkit for building end-to-end ML applications on Ray, is now in Alpha. AIR is an effort to unify the experience of using different Ray libraries (Ray Data, Train, Tune, Serve, RLlib). You can find more information on the docs or on the public RFC.
- Getting involved with Ray AIR. We’ll be holding office hours, development sprints, and other activities as we get closer to the Ray AIR Beta/GA release. Want to join us? Fill out this short form!
Ray usage data collection is now off by default. If you have any questions or concerns, please comment on the RFC.
New algorithms are added to RLlib: SlateQ & Bandits (for recommender systems use cases) and AlphaStar (multi-agent, multi-GPU w/ league-based self-play)
Ray Datasets: new lazy execution model with automatic task fusion and memory-optimizing move semantics; first-class support for Pandas DataFrame blocks; efficient random access datasets.

Ray Autoscaler

🎉 New Features

Support cache_stopped_nodes on Azure (#21747)
AWS Cloudwatch support (#21523)

💫 Enhancements

Improved documentation and standards around built in autoscaler node providers. (#22236, 22237)
Improved KubeRay support (#22987, #22847, #22348, #22188)
Remove redis requirement (#22083)

🔨 Fixes

No longer print infeasible warnings for internal placement group resources. Placement groups which cannot be satisfied by the autoscaler still trigger warnings. (#22235)
Default ami’s per AWS region are updated/fixed. (#22506)
GCP node termination updated (#23101)
Retry legacy k8s operator on monitor failure (#22792)
Cap min and max workers for manually managed on-prem clusters (#21710)
Fix initialization artifacts (#22570)
Ensure initial scaleup with high upscaling_speed isn't limited. (#21953)

Ray Client

🎉 New Features:

ray.init has consistent return value in client mode and driver mode #21355

💫Enhancements:

Gets and puts are streamed to support arbitrary object sizes #22100, #22327

🔨 Fixes:

Fix ray client object ref releasing in wrong context #22025

Ray Core

🎉 New Features

RuntimeEnv:
- Support setting timeout for runtime_env setup. (#23082)
- Support setting pip_check and pip_version for runtime_env. (#22826, #23306)
- env_vars will take effect when the pip install command is executed. (temporarily ineffective in conda) (#22730)
- Support strongly-typed API ray.runtime.RuntimeEnv to define runtime env. (#22522)
- Introduce virtualenv to isolate the pip type runtime env. (#21801,#22309)
Raylet shares fate with the dashboard agent. And the dashboard agent will stay alive when it catches the port conflicts. (#22382,#23024)
Enable dashboard in the minimal ray installation (#21896)
Add task and object reconstruction status to ray memory cli tools(#22317)

🔨 Fixes

Report only memory usage of pinned object copies to improve scaledown. (#22020)
Scheduler:
- No spreading if a node is selected for lease request due to locality. (#22015)
- Placement group scheduling: Non-STRICT_PACK PGs should be sorted by resource priority, size (#22762)
- Round robin during spread scheduling (#21303)
Object store:
- Increment ref count when creating an ObjectRef to prevent object from going out of scope (#22120)
- Cleanup handling for nondeterministic object size during transfer (#22639)
- Fix bug in fusion for spilled objects (#22571)
- Handle IO worker failures correctly (#20752)
Improve ray stop behavior (#22159)
Avoid warning when receiving too much logs from a different job (#22102)
Gcs resource manager bug fix and clean up. (#22462, #22459)
Release GIL when running parallel_memcopy() / memcpy() during serializations. (#22492)
Fix registering serializer before initializing Ray. (#23031)

🏗 Architecture refactoring

Ray distributed scheduler refactoring: (#21927, #21992, #22160, #22359, #22722, #22817, #22880, #22893, #22885, #22597, #22857, #23124)
Removed support for bootstrapping with Redis.

Ray Data Processing

🎉 New Features

Big Performance and Stability Improvements:
- Add lazy execution mode with automatic stage fusion and optimized memory reclamation via block move semantics (#22233, #22374, #22373, #22476)
- Support for random access datasets, providing efficient random access to rows via binary search (#22749)
- Add automatic round-robin load balancing for reading and shuffle reduce tasks, obviating the need for the _spread_resource_prefix hack (#21303)
More Efficient Tabular Data Wrangling:
- Add first-class support for Pandas blocks, removing expensive Arrow <-> Pandas conversion costs (#21894)
- Expose TableRow API + minimize copies/type-conversions on row-based ops (#22305)
Groupby + Aggregations Improvements:
- Support mapping over groupby groups (#22715)
- Support ignoring nulls in aggregations (#20787)
Improved Dataset Windowing:
- Support windowing a dataset by bytes instead of number of blocks (#22577)
- Batch across windows in DatasetPipelines (#22830)
Better Text I/O:
- Support streaming snappy compression for text files (#22486)
- Allow for custom decoding error handling in read_text() (#21967)
- Add option for dropping empty lines in read_text() (#22298)
New Operations:
- Add add_column() utility for adding derived columns (#21967)
Support for metadata provider callback for read APIs (#22896)
Support configuring autoscaling actor pool size (#22574)

🔨 Fixes

Force lazy datasource materialization in order to respect DatasetPipeline stage boundaries (#21970)
Simplify lifetime of designated block owner actor, and don’t create it if dynamic block splitting is disabled (#22007)
Respect 0 CPU resource request when using manual resource-based load balancing (#22017)
Remove batch format ambiguity by always converting Arrow batches to Pandas when batch_format=”native” is given (#21566)
Fix leaked stats actor handle due to closure capture reference counting bug (#22156)
Fix boolean tensor column representation and slicing (#22323)
Fix unhandled empty block edge case in shuffle (#22367)
Fix unserializable Arrow Partitioning spec (#22477)
Fix incorrect iter_epochs() batch format (#22550)
Fix infinite iter_epochs() loop on unconsumed epochs (#22572)
Fix infinite hang on split() when num_shards < num_rows (#22559)
Patch Parquet file fragment serialization to prevent metadata fetching (#22665)
Don’t reuse task workers for actors or GPU tasks (#22482)
Pin pipeline executor actors to driver node to allow for lineage-based fault tolerance for pipelines (#22715)
Always use non-empty blocks to determine schema (#22834)
API fix bash (#22886)
Make label_column optional for to_tf() so it can be used for inference (#22916)
Fix schema() for DatasetPipelines (#23032)
Fix equalized split when num_splits == num_blocks (#23191)

💫 Enhancements

Optimize Parquet metadata serialization via batching (#21963)
Optimize metadata read/write for Ray Client (#21939)
Add sanity checks for memory utilization (#22642)

🏗 Architecture refactoring

Use threadpool to submit DatasetPipeline stages (#22912)

RLlib

🎉 New Features

New “AlphaStar” algorithm: A parallelized, multi-agent/multi-GPU learning algorithm, implementing league-based self-play. (#21356, #21649)
SlateQ algorithm has been re-tested, upgraded (multi-GPU capable, TensorFlow version), and bug-fixed (added to weekly learning tests). (#22389, #23276, #22544, #22543, #23168, #21827, #22738)
Bandit algorithms: Moved into agents folder as first-class citizens, TensorFlow-Version, unified w/ other agents’ APIs. (#22821, #22028, #22427, #22465, #21949, #21773, #21932, #22421)
ReplayBuffer API (in progress): Allow users to customize and configure their own replay buffers and use these inside custom or built-in algorithms. (#22114, #22390, #21808)
Datasets support for RLlib: Dataset Reader/Writer and documentation. (#21808, #22239, #21948)

🔨 Fixes

Fixed memory leak in SimpleReplayBuffer. (#22678)
Fixed Unity3D built-in examples: Action bounds from -inf/inf to -1.0/1.0. (#22247)
Various bug fixes. (#22350, #22245, #22171, #21697, #21855, #22076, #22590, #22587, #22657, #22428, #23063, #22619, #22731, #22534, #22074, #22078, #22641, #22684, #22398, #21685)

🏗 Architecture refactoring

A3C: Moved into new training_iteration API (from exeution_plan API). Lead to a ~2.7x performance increase on a Atari + CNN + LSTM benchmark. (#22126, #22316)
Make multiagent->policies_to_train more flexible via callable option (alternative to providing a list of policy IDs). (#20735)

💫Enhancements:

Env pre-checking module now active by default. (#22191)
Callbacks: Added on_sub_environment_created and on_trainer_init callback options. (#21893, #22493)
RecSim environment wrappers: Ability to use google’s RecSim for recommender systems more easily w/ RLlib algorithms (3 RLlib-ready example environments). (#22028, #21773, #22211)
MARWIL loss function enhancement (exploratory term for stddev). (#21493)

📖Documentation:

Docs enhancements: Setup-dev instructions; Ray datasets integration. (#22239)
Other doc enhancements and fixes. (#23160, #23226, #22496, #22489, #22380)

Ray Workflow

🎉 New Features:

Support skip checkpointing.

🔨 Fixes:

Fix an issue where the event loop is not set.

Tune

🎉 New Features:

Expose new checkpoint interface to users (#22741)

💫Enhancemen...

Contributors

ericl, pcmoritz, and 84 other contributors

Assets 2

09 Mar 01:45

mwtian

ray-1.11.0

fec30a2

Ray-1.11.0

Highlights

🎉 Ray no longer starts Redis by default. Cluster metadata previously stored in Redis is stored in the GCS now.

Ray Autoscaler

🎉 New Features

AWS Cloudwatch dashboard support #20266

💫 Enhancements

Kuberay autoscaler prototype #21086

🔨 Fixes

Ray.autoscaler.sdk import issue #21795

Ray Core

🎉 New Features

Set actor died error message in ActorDiedError #20903
Event stats is enabled by default #21515

🔨 Fixes

Better support for nested tasks
Fixed 16GB mac perf issue by limit the plasma store size to 2GB #21224
Fix SchedulingClassInfo.running_tasks memory leak #21535
Round robin during spread scheduling #19968

🏗 Architecture refactoring

Refactor scheduler resource reporting public APIs #21732
Refactor ObjectManager wait logic to WaitManager #21369

Ray Data Processing

🎉 New Features

More powerful to_torch() API, providing more control over the GPU batch format. (#21117)

🔨 Fixes

Fix simple Dataset sort generating only 1 non-empty block. (#21588)
Improve error handling across sorting, groupbys, and aggregations. (#21610, #21627)
Fix boolean tensor column representation and slicing. (#22358)

RLlib

🎉 New Features

Better utils for flattening complex inputs and enable prev-actions for LSTM/attention for complex action spaces. (#21330)
MultiAgentEnv pre-checker (#21476)
Base env pre-checker. (#21569)

🔨 Fixes

Better defaults for QMix (#21332)
Fix contrib/MADDPG + pettingzoo coop-pong-v4. (#21452)
Fix action unsquashing causes inf/NaN actions for unbounded action spaces. (#21110)
Ignore PPO KL-loss term completely if kl-coeff == 0.0 to avoid NaN values (#21456)
unsquash_action and clip_action (when None) cause wrong actions computed by Trainer.compute_single_action. (#21553)
Conv2d default filter tests and add default setting for 96x96 image obs space. (#21560)
Bing back and fix offline RL(BC & MARWIL) learning tests. (#21574, #21643)
SimpleQ should not use a prio. replay buffer. (#21665)
Fix video recorder env wrapper. Added test case. (#21670)

🏗 Architecture refactoring

Decentralized multi-agent learning (#21421)
Preparatory PR for multi-agent multi-GPU learner (alpha-star style) (#21652)

Ray Workflow

🔨 Fixes

Fixed workflow recovery issue due to a bug of dynamic output #21571

Tune

🎉 New Features

It is now possible to load all evaluated points from an experiment into a Searcher (#21506)
Add CometLoggerCallback (#20766)

💫 Enhancements

Only sync the checkpoint folder instead of the entire trial folder for cloud checkpoint. (#21658)
Add test for heterogeneous resource request deadlocks (#21397)
Remove unused return_or_clean_cached_pg (#21403)
Remove TrialExecutor.resume_trial (#21225)
Leave only one canonical way of stopping a trial (#21021)

🔨 Fixes

Replace deprecated running_sanity_check with sanity_checking in PTL integration (#21831)
Fix loading an ExperimentAnalysis object without a registered Trainable (#21475)
Fix stale node detection bug (#21516)
Fixes to allow tune/tests/test_commands.py to run on Windows (#21342)
Deflake PBT tests (#21366)
Fix dtype coercion in tune.choice (#21270)

📖 Documentation

Fix typo in schedulers.rst (#21777)

Train

🎉 New Features

Add PrintCallback (#21261)
Add MLflowLoggerCallback(#20802)

💫 Enhancements

Refactor Callback implementation (#21468, #21357, #21262)

🔨 Fixes

Fix Dataloader (#21467)

📖 Documentation

Documentation and example fixes (#21761, #21689, #21464)

Serve

🎉 New Features

Checkout our revampt end-to-end tutorial that walks through the deployment journey! (#20765)

🔨 Fixes

Warn when serve.start() with different options (#21562)
Detect http.disconnect and cancel requests properly (#21438)

Thanks
Many thanks to all those who contributed to this release!
@isaac-vidas, @wuisawesome, @stephanie-wang, @jon-chuang, @xwjiang2010, @jjyao, @MissiontoMars, @qbphilip, @yaoyuan97, @gjoliver, @Yard1, @rkooo567, @talesa, @czgdp1807, @DN6, @sven1977, @kfstorm, @krfricke, @simon-mo, @hauntsaninja, @pcmoritz, @JamieSlome, @chaokunyang, @jovany-wang, @sidward14, @DmitriGekhtman, @ericl, @mwtian, @jwyyy, @clarkzinzow, @hckuo, @vakker, @HuangLED, @iycheng, @edoakes, @shrekris-anyscale, @robertnishihara, @avnishn, @mickelliu, @ndrwnaguib, @ijrsvt, @Zyiqin-Miranda, @bveeramani, @SongGuyang, @n30111, @WangTaoTheTonic, @suquark, @richardliaw, @qicosmos, @scv119, @architkulkarni, @lixin-wei, @Catch-Bull, @acxz, @benblack769, @clay4444, @amogkam, @marin-ma, @maxpumperla, @jiaodong, @mattip, @isra17, @raulchen, @wilsonwang371, @carlogrisetti, @ashione, @matthewdeng

Contributors

ericl, pcmoritz, and 65 other contributors

Assets 2

04 Feb 19:23

architkulkarni

ray-1.10.0

5ea5653

Ray-1.10.0

Highlights

🎉 Ray Windows support is now in beta – a significant fraction of the Ray test suite is now passing on Windows. We are eager to learn about your experience with Ray 1.10 on Windows, please file issues you encounter at https://github.com/ray-project/ray/issues. In the upcoming releases we will spend more time on making Ray Serve and Runtime Environment tests pass on Windows and on polishing things.

Ray Autoscaler

💫Enhancements:

Add autoscaler update time to prometheus metrics (#20831)
Fewer non terminated nodes calls in autoscaler update (#20359, #20623)

🔨 Fixes:

GCP TPU autoscaling fix (#20311)
Scale-down stability fix (#21204)
Report node launch failure in driver logs (#20814)

Ray Client

💫Enhancements

Client task options are encoded with pickle instead of json (#20930)

Ray Core

🎉 New Features:

runtime_env’s pip field now installs pip packages in your existing environment instead of installing them in a new isolated environment. (#20341)

🔨 Fixes:

Fix bug where specifying runtime_env conda/pip per-job using local requirements file using Ray Client on a remote cluster didn’t work (#20855)
Security fixes for log4j2 – the log4j2 version has been bumped to 2.17.1 (#21373)

💫Enhancements:

Allow runtime_env working_dir and py_modules to be pathlib.Path type (#20853, #20810)
Add environment variable to skip local runtime_env garbage collection (#21163)
Change runtime_env error log to debug log (#20875)
Improved reference counting for runtime_env resources (#20789)

🏗 Architecture refactoring:

Refactor runtime_env to use protobuf for multi-language support (#19511)

📖Documentation:

Add more comprehensive runtime_env documentation (#20222, #21131, #20352)

Ray Data Processing

🎉 New Features:

Added stats framework for debugging Datasets performance (#20867, #21070)
[Dask-on-Ray] New config helper for enabling the Dask-on-Ray scheduler (#21114)

💫Enhancements:

Reduce memory usage during when converting to a Pandas DataFrame (#20921)

🔨 Fixes:

Fix slow block evaluation when splitting (#20693)
Fix boundary sampling concatenation on non-uniform blocks (#20784)
Fix boolean tensor column slicing (#20905)

🏗 Architecture refactoring:

Refactor table block structure to support more tabular block formats (#20721)

RLlib

🎉 New Features:

Support for RE3 exploration algorithm (for tf only). (#19551)
Environment pre-checks, better failure behavior and enhanced environment API. (#20481, #20832, #20868, #20785, #21027, #20811)

🏗 Architecture refactoring:

Evaluation: Support evaluation setting that makes sure train doesn't ever have to wait for eval to finish (b/c of long episodes). (#20757); Always attach latest eval metrics. (#21011)
Soft-deprecate build_trainer() utility function in favor of sub-classing Trainer directly (and overriding some of its methods). (#20635, #20636, #20633, #20424, #20570, #20571, #20639, #20725)
Experimental no-flatten option for actions/prev-actions. (#20918)
Use SampleBatch instead of an input dict whenever possible. (#20746)
Switch off Preprocessors by default for PGTrainer (experimental). (#21008)
Toward a Replay Buffer API (cleanups; docstrings; renames; move into rllib/execution/buffers dir) (#20552)

📖Documentation:

Overhaul of auto-API reference pages. (#19786, #20537, #20538, #20486, #20250)
README and RLlib landing page overhaul (#20249).
Added example containing code to compute an adapted (time-dependent) GAE used by the PPO algorithm (#20850).

🔨 Fixes:

Smaller fixes and enhancements: #20704, #20541, #20793, #20743.

Tune

🎉 New Features:

Introduce TrialCheckpoint class, making checkpoint down/upload easie (#20585)
Add random state to BasicVariantGenerator (#20926)
Multi-objective support for Optuna (#20489)

💫Enhancements:

Add set_max_concurrency to Searcher API (#20576)
Allow for tuples in _split_resolved_unresolved_values. (#20794)
Show the name of training func, instead of just ImplicitFunction. (#21029)
Enforce one future at a time for any given trial at any given time. (#20783)
move on_no_available_trials to a subclass under runner (#20809)
Clean up code (#20555, #20464, #20403, #20653, #20796, #20916, #21067)
Start restricting TrialRunner/Executor interface exposures. (#20656)
TrialExecutor should not take in Runner interface. (#20655)

🔨Fixes:

Deflake test_tune_restore.py (#20776)
Fix best_trial_str for nested custom parameter columns (#21078)
Fix checkpointing error message on K8s (#20559)
Fix testResourceScheduler and testMultiStepRun. (#20872)
Fix tune cloud tests for function and rllib trainables (#20536)
Move _head_bundle_is_empty after conversion (#21039)
Elongate test_trial_scheduler_pbt timeout. (#21120)

Train

🔨Fixes:

Ray Train environment variables are automatically propagated and do not need to be manually set on every node (#20523)
Various minor fixes and improvements (#20952, #20893, #20603, #20487)
📖Documentation:
Update saving/loading checkpoint docs (#20973). Thanks @jwyyy!
Various minor doc updates (#20877, #20683)

Serve

💫Enhancements:

Add validation to Serve AutoscalingConfig class (#20779)
Add Serve metric for HTTP error codes (#21009)

🔨Fixes:

No longer create placement group for deployment with no resources (#20471)
Log errors in deployment initialization/configuration user code (#20620)

Jobs

🎉 New Features:

Logs can be streamed from job submission server with ray job logs command (#20976)
Add documentation for ray job submission (#20530)
Propagate custom headers field to JobSubmissionClient and apply to all requests (#20663)

🔨Fixes: