-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC-0033-GDS-checkpointing #59
base: master
Are you sure you want to change the base?
Conversation
Hi @antferdom! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
@facebook-github-bot label commenting |
- Storage->CPU->GPU_ASYNC | ||
- Storage->PAGE_CACHE->CPU->GPU | ||
- Storage->GPU_ASYNC | ||
- Storage->GPU_BATCH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be a BC-breaking change? e.g., would all calls to torch.load
and torch.save
need to specify this argument, or would it be automatically inferred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try to let the current torch.save
& torch.load
API surface intact. Therefore we could auto infer it or default the input parameter to the common current scheme.
Thanks @antferdom --- we're starting to look into this now and are currently working on verifying if the performance looks promising compared to the current native PyTorch implementation(s) |
@eqy if possible I would like to join the formal evaluations and experimentation you and your team are considering to perform. We are excited about co-developing this feature, or even just correctly studying its viability and implications. |
Any update @eqy? |
@Aidyn-A and @mikaylagawarecki are currently working on it |
@eqy Thanks for your fast response, I appreciate it. Would it be possible for us to join the ongoing research about the feasibility of this? We truly want to push the development and integration of this to PyTorch core if it matches the performance expectations. @mikaylagawarecki |
Hey @antferdom, thank you for your enthusiasm in pushing this forward! Let me try to give a summary of where we are at so far. From preliminary discussions my understanding is that there are 3 broad classes of cuFile APIs for GPUDirect Storage (1) synchronous: If the performance properties were reasonable we had plans to As a first step, we were trying to benchmark (1) with NVMe in non-compatibility mode. @Aidyn-A created a pytorch extension for synchronous saving/loading of tensors with benchmarking utilities here and I have a very preliminary prototype of upstreaming it into The install process for GPUDirect Storage in non-compatibility mode on the user end is tricky, we have not successfully gotten it to run in non-compatiblity mode with NVMe yet (the latest issue I personally had had to do with the I am curious -- do you have benchmark numbers of the performance of GPUDirect Storage in non-compatibility mode? If so would you be willing to share these + the hardware configuration/filesystem type you used for the benchmarks? 😄 |
Hi @mikaylagawarecki! absolutely, I will gladly share my benchmark configuration and results with you. I am dealing with some issues while trying to reproduce my original environment with the new Linux kernel version I'm using (6.5.0-14-generic). The following illustrates the uncomplete environment configuration for running without compatibility mode: enabled: ============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe : Unsupported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Disabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 0
execution.max_io_queue_depth : 128
execution.parallel_io : false
execution.min_io_threshold_size_kb : 8192
execution.max_request_parallelism : 0
=========
GPU INFO:
=========
GPU index 0 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled
GPU index 1 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled I'm going to test the current implementation prototypes that you referenced, because maybe my initial attempts and their result no longer hold in comparison with the new ones generated by these implementations. As highlighted in the RFC, all my experiments made use of cuFile API via rapidsai/kvikio |
Benchmarking GPUDirect in Non-Compatibility ModeSystem InformationDistro Version: Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04 Kernel Version 5.15.0-101-generic Hardware Configuration
IOMMU
Status: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-101-generic root=/dev/mapper/vgroot-lvroot ro processor.max_cstate=1 amd_iommu=off
[ 0.403876] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-101-generic root=/dev/mapper/vgroot-lvroot ro processor.max_cstate=1 amd_iommu=off
[ 1.159437] iommu: Default domain type: Translated
[ 1.159437] iommu: DMA domain TLB invalidation policy: lazy mode MLNX_OFED Requirements and Installationreference: 14. Troubleshooting and FAQ for NVMe and NVMeOF Support apt install nvidia-gds-12-1
apt install nvidia-fs=2.17.3-1 nvidia-fs-dkms=2.17.3-1 # to downgrade
modprobe nvidia-fs ./mlnxofedinstall --with-nvmf --with-nfsrdma --enable-gds --add-kernel-support --dkms
apt install --reinstall `dpkg -l | grep 545 | awk '{print $2}'`
modprobe nvidia-peermem
modprobe nvme-rdma
modprobe nvmet-rdma Displaying GDS NVIDIA FS Driver Statistics
GDS Version: 1.7.1.12
NVFS statistics(ver: 4.0)
NVFS Driver(version: 2.17.3)
Mellanox PeerDirect Supported: True
IO stats: Disabled, peer IO stats: Disabled
Logging level: info
Active Shadow-Buffer (MiB): 0
Active Process: 0
Reads : err=0 io_state_err=0
Sparse Reads : n=0 io=0 holes=0 pages=0
Writes : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0
Mmap : n=0 ok=0 err=0 munmap=0
Bar1-map : n=0 ok=0 err=0 free=0 callbacks=0 active=0 delay-frees=0
Error : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0
Ops : Read=0 Write=0 BatchIO=0 Disk & Filesystem Information
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
nvme0n1 259:0 0 1.8T 0 disk
├─nvme0n1p1 259:1 0 1M 0 part
└─nvme0n1p2 259:2 0 1.8T 0 part
└─vgroot-lvroot 253:0 0 1.8T 0 lvm /
*-namespace:0
description: NVMe disk
physical id: 0
logical name: hwmon0
*-namespace:1
description: NVMe disk
physical id: 2
logical name: /dev/ng0n1
*-namespace:2
description: NVMe disk
physical id: 1
bus info: nvme@0:1
logical name: /dev/nvme0n1
size: 1863GiB (2TB)
capabilities: gpt-1.00 partitioned partitioned:gpt
configuration: guid=6546f52c-62a4-ed48-b04a-2551ce27034c logicalsectorsize=512 sectorsize=512 wwid=eui.e8238fa6bf530001001b448b4cacce2b
WD Red SN700 2000GB
Verifiying a Successful GDS InstallationTo verify that GDS installation was successful, run
warn: error opening log file: Permission denied, logging will be disabled
GDS release version: 1.6.1.9
nvidia_fs version: 2.17 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe : Supported
NVMeOF : Supported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Enabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 0
execution.max_io_queue_depth : 128
execution.parallel_io : false
execution.min_io_threshold_size_kb : 8192
execution.max_request_parallelism : 0
=========
GPU INFO:
=========
GPU index 0 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled
GPU index 1 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled
==============
PLATFORM INFO:
==============
IOMMU: disabled
Platform verification succeeded Experiment 0: Synthetic 10 GB Torch tensor cuFile save/loadAssert Initial Conditions: NVFS status and properties import kvikio
from kvikio import CuFile
import kvikio.defaults
from kvikio.defaults import set_compat_mode, compat_mode, compat_mode_reset
def test_compat_mode() -> None:
before = compat_mode()
print(f"Driver compat mode: {before}")
with set_compat_mode(True):
assert compat_mode()
compat_mode_reset(False)
assert not compat_mode()
assert before == compat_mode()
# test_compat_mode()
print(f"Compability mode: {compat_mode()}")
handle = kvikio.libkvikio.DriverProperties()
props = kvikio.DriverProperties()
print(f"GDS Driver availability: {props.is_gds_available}")
if props.is_gds_available: print(f"v{props.major_version}.{props.minor_version}") Compability mode: False
GDS Driver availability: True
v2.17 source code: cutorch.py # %%
import kvikio
import kvikio.defaults
from kvikio.defaults import (
get_num_threads,
set_num_threads,
)
import cupy as cp
import torch
import logging
import time
import os
from pathlib import Path
logging.basicConfig(level=logging.INFO)
logger: logging.Logger = logging.getLogger(__name__)
TENSOR_DIMS = (50_000, 50_000)
TENSOR_FN = Path("consolidated.00.pth")
NUM_THREADS = 32
before = get_num_threads()
print(f"Tensor dimensions: {TENSOR_DIMS}")
print(f"Tensor fn: {TENSOR_FN}")
print(f"kvikio number of threads: {before}")
print(f"GPU number of threads: {NUM_THREADS}") Tensor dimensions: (50000, 50000)
Tensor fn: consolidated.00.pth
kvikio number of threads: 1
GPU number of threads: 32 # %%
# cuFile serialization
st = time.perf_counter_ns()
x = torch.empty(*TENSOR_DIMS, device="cuda")
x_cu = cp.asarray(x)
# Write whole array to file
with kvikio.defaults.set_num_threads(NUM_THREADS):
assert get_num_threads() == NUM_THREADS
f = kvikio.CuFile(TENSOR_FN, "w")
f.write(x_cu)
f.close()
et = time.perf_counter_ns() - st
print(f"cuFile serilization elapsed time: {et*1e-9:.2f} s")
del x, x_cu
torch.cuda.empty_cache() cuFile serilization elapsed time: 3.72 s # %%
# cuFile torch tensor deserialization
import cupy
# import cunumeric as num
tensor_size = os.path.getsize(TENSOR_FN)
print(f"Tensor size: {tensor_size / 1e09:.2f} GB")
x_cu = cp.asarray(torch.empty(*TENSOR_DIMS, device="cuda"))
# x_cu = cp.empty(shape=(50_000, 50_000))
st = time.perf_counter_ns()
with kvikio.defaults.set_num_threads(NUM_THREADS):
assert get_num_threads() == NUM_THREADS
st = time.perf_counter_ns()
f = kvikio.CuFile(TENSOR_FN, "r")
f.read(x_cu)
x_cutorch = torch.as_tensor(x_cu, device="cuda")
print(f"Tensor loading time: {(time.perf_counter_ns() - st)*1e-9:.4f} s")
print(f"Device: {x_cutorch.device}")
# %%
del x_cutorch, x_cu
torch.cuda.empty_cache() Tensor size: 10.00 GB
Tensor loading time: 3.4625 s
Device: cuda:0 Verify that system caches are not impacting the experiment measurements: vmtouch consolidated.00.pth
Files: 1
Directories: 0
Resident Pages: 191/2441407 764K/9G 0.00782%
Elapsed: 0.060878 seconds |
DeepSpeed upcoming feature focusing NVME technologies with GDS (version >= 0.15). DeepNVMe: Improving DL Applications through I/O Optimizations @mikaylagawarecki |
No description provided.