Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot serialise number: must not be NaN or Infinity #220

Open
thehowl opened this issue Mar 27, 2018 · 1 comment
Open

Cannot serialise number: must not be NaN or Infinity #220

thehowl opened this issue Mar 27, 2018 · 1 comment

Comments

@thehowl
Copy link

thehowl commented Mar 27, 2018

Running torch-rnn, when saving checkpoints I get this from time to time:

/home/howl/torch-cl/install/bin/luajit: ./util/utils.lua:50: Cannot serialise number: must not be NaN or Infinity
stack traceback:
	[C]: in function 'encode'
	./util/utils.lua:50: in function 'write_json'
	train.lua:234: in main chunk
	[C]: in function 'dofile'
	...l/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
	[C]: at 0x55832286e450

I'm running torch-cl with the following:

th train.lua -input_h5 ../data.h5 -input_json ../data.json -gpu_backend opencl -init_from cv/checkpoint_74000.t7 -reset_iterations 0

The last two options were added because I had the problem already in previous runs.

Graphics card is a NVIDIA GeForce 620 OEM. Using OpenCL because running CUDA seems close to impossible or very hard anyway on my machine (it's sort of like an NVIDIA Optimus laptop, but it's a Dell workstation. Can find out the model if needed).

Running on Debian GNU/Linux sid (unstable).

@thehowl
Copy link
Author

thehowl commented Mar 27, 2018

As it turns out, this issue seems to be caused by Inf trying to be added to the output (so for some reason when calculating loss there's a div by 0). When the loss history is encoded, it encounters the Inf and throws an error. If anyone else finds this issue, I quickly patched this in my local repo:

diff --git a/train.lua b/train.lua
index 52210ec..e11869b 100644
--- a/train.lua
+++ b/train.lua
@@ -185,7 +185,11 @@ for i = start_i + 1, num_iterations do
   -- Take a gradient step and maybe print
   -- Note that adam returns a singleton array of losses
   local _, loss = optim.adam(f, params, optim_config)
-  table.insert(train_loss_history, loss[1])
+  if loss[1] == math.huge or loss[1] == -math.huge or loss[1] ~= loss[1] then
+    print(string.format("Can't represent %f in JSON, so not adding to the training loss history", loss[1]))
+  else
+    table.insert(train_loss_history, loss[1])
+  end
   if opt.print_every > 0 and i % opt.print_every == 0 then
     local float_epoch = i / num_train + 1
     local msg = 'Epoch %.2f / %d, i = %d / %d, loss = %f'

Handles +Inf, -Inf and NaN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant