Summarizing changes with PR #7:
A few bug fixes and tweaks for a stronger baseline.
This improves MRR from 0.5845 to 0.6155 and NDCG from 0.5070 to 0.5315 on val.
Changes:
- Switched off dropout during evaluation on val in train.py.
- Shuffling batches during training (shuffle=True to DataLoader).
- Explicitly clearing GPU memory cache with torch.cuda.empty_cache(). Negligible time hit on single GPU, and fits batch sizes of up to 32 x no. of GPUs. There's some time gain when training with larger batch sizes.
- Added a linear learning rate warm up (https://arxiv.org/abs/1706.02677), followed by multi-step decaying.
- Using a multi-layer LSTM + dropout for the decoder.
- Switched from dot-product attention to a richer element-wise multiplication + fc layer attention. (The network can learn dot-product attention if it needs to.)