Cannot serialise number: must not be NaN or Infinity #220

thehowl · 2018-03-27T13:54:00Z

Running torch-rnn, when saving checkpoints I get this from time to time:

/home/howl/torch-cl/install/bin/luajit: ./util/utils.lua:50: Cannot serialise number: must not be NaN or Infinity
stack traceback:
	[C]: in function 'encode'
	./util/utils.lua:50: in function 'write_json'
	train.lua:234: in main chunk
	[C]: in function 'dofile'
	...l/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
	[C]: at 0x55832286e450

I'm running torch-cl with the following:

th train.lua -input_h5 ../data.h5 -input_json ../data.json -gpu_backend opencl -init_from cv/checkpoint_74000.t7 -reset_iterations 0

The last two options were added because I had the problem already in previous runs.

Graphics card is a NVIDIA GeForce 620 OEM. Using OpenCL because running CUDA seems close to impossible or very hard anyway on my machine (it's sort of like an NVIDIA Optimus laptop, but it's a Dell workstation. Can find out the model if needed).

Running on Debian GNU/Linux sid (unstable).

The text was updated successfully, but these errors were encountered:

thehowl · 2018-03-27T17:32:41Z

As it turns out, this issue seems to be caused by Inf trying to be added to the output (so for some reason when calculating loss there's a div by 0). When the loss history is encoded, it encounters the Inf and throws an error. If anyone else finds this issue, I quickly patched this in my local repo:

diff --git a/train.lua b/train.lua
index 52210ec..e11869b 100644
--- a/train.lua
+++ b/train.lua
@@ -185,7 +185,11 @@ for i = start_i + 1, num_iterations do
   -- Take a gradient step and maybe print
   -- Note that adam returns a singleton array of losses
   local _, loss = optim.adam(f, params, optim_config)
-  table.insert(train_loss_history, loss[1])
+  if loss[1] == math.huge or loss[1] == -math.huge or loss[1] ~= loss[1] then
+    print(string.format("Can't represent %f in JSON, so not adding to the training loss history", loss[1]))
+  else
+    table.insert(train_loss_history, loss[1])
+  end
   if opt.print_every > 0 and i % opt.print_every == 0 then
     local float_epoch = i / num_train + 1
     local msg = 'Epoch %.2f / %d, i = %d / %d, loss = %f'

Handles +Inf, -Inf and NaN.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot serialise number: must not be NaN or Infinity #220

Cannot serialise number: must not be NaN or Infinity #220

thehowl commented Mar 27, 2018

thehowl commented Mar 27, 2018 •

edited

Loading

Cannot serialise number: must not be NaN or Infinity #220

Cannot serialise number: must not be NaN or Infinity #220

Comments

thehowl commented Mar 27, 2018

thehowl commented Mar 27, 2018 • edited Loading

thehowl commented Mar 27, 2018 •

edited

Loading