Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code is not running correctly on my supercomputer: #1

Open
wangjiajiTHU opened this issue Feb 2, 2024 · 1 comment
Open

Code is not running correctly on my supercomputer: #1

wangjiajiTHU opened this issue Feb 2, 2024 · 1 comment

Comments

@wangjiajiTHU
Copy link

Code is not running correctly on my supercomputer:
/sqfs/work/G15408/v60646/conda_env/nclaw) [v60646@squidhpc3 train]$ python invariant_full_meta-invariant_full_meta.py
env:
blob:
bsdf_pcd:
type: diffuse
reflectance:
type: rgb
value:
- 0.92941176
- 0.32941176
- 0.23137255
material:
elasticity:
cls: InvariantFullMetaElasticity
layer_widths:
- 64
- 64
norm: null
nonlinearity: gelu
no_bias: true
normalize_input: true
requires_grad: true
plasticity:
cls: InvariantFullMetaPlasticity
layer_widths:
- 64
- 64
norm: null
alpha: 0.001
nonlinearity: gelu
no_bias: true
normalize_input: true
requires_grad: true
name: jelly
ckpt: null
shape:
type: cube
name: dataset
center:
- 0.5
- 0.5
- 0.5
size:
- 0.5
- 0.5
- 0.5
resolution: 10
mode: uniform
sort: null
vel:
random: false
lin_vel:
- 1.0
- -1.5
- -2.0
ang_vel:
- 4.0
- 4.0
- 4.0
name: jelly
rho: 1000.0
span:
- 0
- 1000
clip_bound: 0.5
render:
spp: 32
width: 512
height: 512
skip_frame: 25
bound: 1.75
mpm_mul: 6
sph_version: cuda_ad_rgb
pcd_version: cuda_ad_rgb
has_sphere_emitter: true
fps: 10
sim:
quality: low
num_steps: 1000
gravity:

  • 0.0
  • -9.8
  • 0.0
    bc: freeslip
    num_grids: 20
    dt: 0.0005
    bound: 3
    eps: 1.0e-07
    skip_frame: 1
    train:
    teacher:
    strategy: cosine
    start_lambda: 25
    end_lambda: 200
    num_epochs: 300
    batch_size: 128
    elasticity_lr: 1.0
    plasticity_lr: 0.1
    elasticity_wd: 0.0
    plasticity_wd: 0.0
    elasticity_grad_max_norm: 0.1
    plasticity_grad_max_norm: 0.1
    name: jelly/train/invariant_full_meta-invariant_full_meta
    seed: 0
    cpu: 0
    num_cpus: 128
    gpu: 0
    overwrite: false
    resume: false

Warp 0.11.0 initialized:
CUDA Toolkit: 11.5, Driver: 12.0
Devices:
"cpu" | x86_64
"cuda:0" | Quadro RTX 6000 (sm_75)
Kernel cache: /sqfs/home/v60646/.cache/warp/0.11.0
target directory (/sqfs2/cmc/1/work/G15408/v60646/github/NCLaw/experiments/log/jelly/train/invariant_full_meta-invariant_full_meta) already exists, overwrite? [Y/r/n] y
overwriting directory (/sqfs2/cmc/1/work/G15408/v60646/github/NCLaw/experiments/log/jelly/train/invariant_full_meta-invariant_full_meta)
0%| | 0/1000 [00:00<?, ?it/s]/sqfs/work/G15408/v60646/conda_env/nclaw/lib/python3.10/site-packages/warp/torch.py:159: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at /opt/conda/conda-bld/pytorch_1704987280714/work/build/aten/src/ATen/core/TensorBody.h:489.)
if t.grad is None:
0%| | 0/300 [00:04<?, ?it/s]
Error executing job with overrides: ['overwrite=False', 'resume=False', 'gpu=0', 'cpu=0', 'env=jelly', 'env/blob/material/elasticity=invariant_full_meta', 'env/blob/material/plasticity=invariant_full_meta', 'env.blob.material.elasticity.requires_grad=True', 'env.blob.material.plasticity.requires_grad=True', 'render=debug', 'sim=low', 'name=jelly/train/invariant_full_meta-invariant_full_meta']
Traceback (most recent call last):
File "/sqfs2/cmc/1/work/G15408/v60646/github/NCLaw/experiments/train.py", line 131, in main
loss.backward()
File "/sqfs/work/G15408/v60646/conda_env/nclaw/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/sqfs/work/G15408/v60646/conda_env/nclaw/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/sqfs/work/G15408/v60646/conda_env/nclaw/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
return user_fn(self, *args)
File "/sqfs2/cmc/1/work/G15408/v60646/github/NCLaw/nclaw/sim/interface.py", line 62, in backward
model.backward(statics, state_curr, state_next, tape)
File "/sqfs2/cmc/1/work/G15408/v60646/github/NCLaw/nclaw/sim/mpm.py", line 313, in backward
tape.backward()
File "/sqfs/work/G15408/v60646/conda_env/nclaw/lib/python3.10/site-packages/warp/tape.py", line 119, in backward
adj_inputs.append(self.get_adjoint(a))
File "/sqfs2/cmc/1/work/G15408/v60646/github/NCLaw/nclaw/warp/tape.py", line 24, in get_adjoint
adj = wp.codegen.StructInstance(a.struct)
AttributeError: 'NewStructInstance' object has no attribute 'struct'

@PingchuanMa
Copy link
Owner

Sorry for the late reply! Please try to replace the tape.py file with this file. The problem roots in the incompatibility of warp when upgrading from 0.6.1 to what you were using (0.11.0) from what I saw. Let me know if it helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants