You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm also part of the research on Zero Shot Tempoarl Localization Action, and I found that if I use Transformer to model CLIP video frame features, it leads to high mAP in the training set and low mAP in the test set. My guess is that the video frame information from CLIP, after Transformer leads to difficulty in matching with text information. What is the core of solving this problem?
Can you help me? 😭
The text was updated successfully, but these errors were encountered:
Hi, I'm also part of the research on Zero Shot Tempoarl Localization Action, and I found that if I use Transformer to model CLIP video frame features, it leads to high mAP in the training set and low mAP in the test set. My guess is that the video frame information from CLIP, after Transformer leads to difficulty in matching with text information. What is the core of solving this problem?
Can you help me? 😭
The text was updated successfully, but these errors were encountered: