You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running torch-rnn, when saving checkpoints I get this from time to time:
/home/howl/torch-cl/install/bin/luajit: ./util/utils.lua:50: Cannot serialise number: must not be NaN or Infinity
stack traceback:
[C]: in function 'encode'
./util/utils.lua:50: in function 'write_json'
train.lua:234: in main chunk
[C]: in function 'dofile'
...l/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x55832286e450
The last two options were added because I had the problem already in previous runs.
Graphics card is a NVIDIA GeForce 620 OEM. Using OpenCL because running CUDA seems close to impossible or very hard anyway on my machine (it's sort of like an NVIDIA Optimus laptop, but it's a Dell workstation. Can find out the model if needed).
Running on Debian GNU/Linux sid (unstable).
The text was updated successfully, but these errors were encountered:
As it turns out, this issue seems to be caused by Inf trying to be added to the output (so for some reason when calculating loss there's a div by 0). When the loss history is encoded, it encounters the Inf and throws an error. If anyone else finds this issue, I quickly patched this in my local repo:
diff --git a/train.lua b/train.lua
index 52210ec..e11869b 100644
--- a/train.lua+++ b/train.lua@@ -185,7 +185,11 @@ for i = start_i + 1, num_iterations do
-- Take a gradient step and maybe print
-- Note that adam returns a singleton array of losses
local _, loss = optim.adam(f, params, optim_config)
- table.insert(train_loss_history, loss[1])+ if loss[1] == math.huge or loss[1] == -math.huge or loss[1] ~= loss[1] then+ print(string.format("Can't represent %f in JSON, so not adding to the training loss history", loss[1]))+ else+ table.insert(train_loss_history, loss[1])+ end
if opt.print_every > 0 and i % opt.print_every == 0 then
local float_epoch = i / num_train + 1
local msg = 'Epoch %.2f / %d, i = %d / %d, loss = %f'
Running torch-rnn, when saving checkpoints I get this from time to time:
I'm running torch-cl with the following:
The last two options were added because I had the problem already in previous runs.
Graphics card is a NVIDIA GeForce 620 OEM. Using OpenCL because running CUDA seems close to impossible or very hard anyway on my machine (it's sort of like an NVIDIA Optimus laptop, but it's a Dell workstation. Can find out the model if needed).
Running on Debian GNU/Linux sid (unstable).
The text was updated successfully, but these errors were encountered: