Increasing trend in the research community for video processing using artificial intelligence. Trending Tasks:
- Video classification.
- Video content description.
- Video question answering (VQA).
The main idea of the project is that searching for partition of video which is most relevent to a corresponding query "Question".
Instead of watching the complete video to find the interval you want to watch, you will give our model the video and the query which describes the part you want, then our model will give you the intervals sorted by relevance to the given query.
We use the Microsoft Research Video to Text (MSR-VTT) dataset.
Example of the dataset is shown below.
We extracted the visual features of the data set using 3 different models.
-
ResNet-152 (like paper): gdrive link
-
NASNet: gdrive link
-
Inception-ResNet-v2: gdrive link
Here is the base architecture which is used in paper here.
We have trained the model using different visual features extractors and changed a bit in the model architecture.
-
Using ResNet visual features extractor (like paper): gdrive link
-
Using NASNet visual features extractor: gdrive
-
Using Inception-ResNet-v2 visual features extractor: gdrive link
-
Using Squeeze and Excitation technique with Inception-ResNet-v2: gdrive line
-
Using Dropout technique: gdrive link
-
Using Squeeze and Excitation along with Dropout: gdrive link
-
Using Squeeze and Excitation technique and increasing hidden dimension of the LSTMs: gdrive link
From the results obtained in the explained experiments, we found out that the best results obtained are from using Inception-ResNet-v2 as feature extractor for the visual features.
Our model outperforms the original paper model in all used metrics as shown in the following table:
These results obtained from testing on the test set which contains 2990 videos.
You can see the comparison between all models in the following figure:
Contributions are always welcome!
Please read the contribution guidelines first.
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details