Code is not running correctly on my supercomputer: #1

wangjiajiTHU · 2024-02-02T03:39:40Z

Code is not running correctly on my supercomputer:
/sqfs/work/G15408/v60646/conda_env/nclaw) [v60646@squidhpc3 train]$ python invariant_full_meta-invariant_full_meta.py
env:
blob:
bsdf_pcd:
type: diffuse
reflectance:
type: rgb
value:
- 0.92941176
- 0.32941176
- 0.23137255
material:
elasticity:
cls: InvariantFullMetaElasticity
layer_widths:
- 64
- 64
norm: null
nonlinearity: gelu
no_bias: true
normalize_input: true
requires_grad: true
plasticity:
cls: InvariantFullMetaPlasticity
layer_widths:
- 64
- 64
norm: null
alpha: 0.001
nonlinearity: gelu
no_bias: true
normalize_input: true
requires_grad: true
name: jelly
ckpt: null
shape:
type: cube
name: dataset
center:
- 0.5
- 0.5
- 0.5
size:
- 0.5
- 0.5
- 0.5
resolution: 10
mode: uniform
sort: null
vel:
random: false
lin_vel:
- 1.0
- -1.5
- -2.0
ang_vel:
- 4.0
- 4.0
- 4.0
name: jelly
rho: 1000.0
span:
- 0
- 1000
clip_bound: 0.5
render:
spp: 32
width: 512
height: 512
skip_frame: 25
bound: 1.75
mpm_mul: 6
sph_version: cuda_ad_rgb
pcd_version: cuda_ad_rgb
has_sphere_emitter: true
fps: 10
sim:
quality: low
num_steps: 1000
gravity:

0.0
-9.8
0.0
bc: freeslip
num_grids: 20
dt: 0.0005
bound: 3
eps: 1.0e-07
skip_frame: 1
train:
teacher:
strategy: cosine
start_lambda: 25
end_lambda: 200
num_epochs: 300
batch_size: 128
elasticity_lr: 1.0
plasticity_lr: 0.1
elasticity_wd: 0.0
plasticity_wd: 0.0
elasticity_grad_max_norm: 0.1
plasticity_grad_max_norm: 0.1
name: jelly/train/invariant_full_meta-invariant_full_meta
seed: 0
cpu: 0
num_cpus: 128
gpu: 0
overwrite: false
resume: false

Warp 0.11.0 initialized:
CUDA Toolkit: 11.5, Driver: 12.0
Devices:
"cpu" | x86_64
"cuda:0" | Quadro RTX 6000 (sm_75)
Kernel cache: /sqfs/home/v60646/.cache/warp/0.11.0
target directory (/sqfs2/cmc/1/work/G15408/v60646/github/NCLaw/experiments/log/jelly/train/invariant_full_meta-invariant_full_meta) already exists, overwrite? [Y/r/n] y
overwriting directory (/sqfs2/cmc/1/work/G15408/v60646/github/NCLaw/experiments/log/jelly/train/invariant_full_meta-invariant_full_meta)
0%| | 0/1000 [00:00<?, ?it/s]/sqfs/work/G15408/v60646/conda_env/nclaw/lib/python3.10/site-packages/warp/torch.py:159: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at /opt/conda/conda-bld/pytorch_1704987280714/work/build/aten/src/ATen/core/TensorBody.h:489.)
if t.grad is None:
0%| | 0/300 [00:04<?, ?it/s]
Error executing job with overrides: ['overwrite=False', 'resume=False', 'gpu=0', 'cpu=0', 'env=jelly', 'env/blob/material/elasticity=invariant_full_meta', 'env/blob/material/plasticity=invariant_full_meta', 'env.blob.material.elasticity.requires_grad=True', 'env.blob.material.plasticity.requires_grad=True', 'render=debug', 'sim=low', 'name=jelly/train/invariant_full_meta-invariant_full_meta']
Traceback (most recent call last):
File "/sqfs2/cmc/1/work/G15408/v60646/github/NCLaw/experiments/train.py", line 131, in main
loss.backward()
File "/sqfs/work/G15408/v60646/conda_env/nclaw/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/sqfs/work/G15408/v60646/conda_env/nclaw/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/sqfs/work/G15408/v60646/conda_env/nclaw/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
return user_fn(self, *args)
File "/sqfs2/cmc/1/work/G15408/v60646/github/NCLaw/nclaw/sim/interface.py", line 62, in backward
model.backward(statics, state_curr, state_next, tape)
File "/sqfs2/cmc/1/work/G15408/v60646/github/NCLaw/nclaw/sim/mpm.py", line 313, in backward
tape.backward()
File "/sqfs/work/G15408/v60646/conda_env/nclaw/lib/python3.10/site-packages/warp/tape.py", line 119, in backward
adj_inputs.append(self.get_adjoint(a))
File "/sqfs2/cmc/1/work/G15408/v60646/github/NCLaw/nclaw/warp/tape.py", line 24, in get_adjoint
adj = wp.codegen.StructInstance(a.struct)
AttributeError: 'NewStructInstance' object has no attribute 'struct'

PingchuanMa · 2024-06-05T00:16:44Z

Sorry for the late reply! Please try to replace the tape.py file with this file. The problem roots in the incompatibility of warp when upgrading from 0.6.1 to what you were using (0.11.0) from what I saw. Let me know if it helps.

SangHunHan92 mentioned this issue Aug 13, 2024

Gradient of the training code diverges #4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code is not running correctly on my supercomputer: #1

Code is not running correctly on my supercomputer: #1

wangjiajiTHU commented Feb 2, 2024

PingchuanMa commented Jun 5, 2024

Code is not running correctly on my supercomputer: #1

Code is not running correctly on my supercomputer: #1

Comments

wangjiajiTHU commented Feb 2, 2024

PingchuanMa commented Jun 5, 2024