You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
We found that video text joint loss in pretraining is calculated from masked video and text. Why not use the origin video and text like retrieval finetune?
Hi @zhangliang-04, we use the masked sequences for the consistency of other losses. An elaborate design for the retrieval task may benefit from a non-masked version, however, we have not tested on it. Maybe it can improve performance further.
Hi,
We found that video text joint loss in pretraining is calculated from masked video and text. Why not use the origin video and text like retrieval finetune?
UniVL/modules/modeling.py
Line 258 in 0a7c07f
The text was updated successfully, but these errors were encountered: