Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very impressive work! (Feel free to discuss about the paper here!) #1

Closed
Sissuire opened this issue Nov 14, 2022 · 7 comments
Closed
Labels
good first issue Good for newcomers

Comments

@Sissuire
Copy link

Always the quality is entangled by aesthetic and technical effects, especially for UGC videos. The idea is quite clear and the performance is good!

After reading the work, some confusions come to my mind, and I'd like to discuss them with all the friends interested in this topic.

  1. Throughout the work, it seems that the disentanglement results in the great improvement. Here's the confusion why disentanglement could improve the performance. Can we just believe that entangled representations with both aesthetic and technical features could restrict the task, and the disentangled ones work better?

  2. I've seen different network structures adopted in the work (e.g., inflated-ConvNext, Swin Trasformer). The most popular model in VQA I thought must be ResNet-50. So, how the different networks affect the performance? Has anyone conducted a detailed experiment on the different network structures?

@teowu
Copy link
Member

teowu commented Nov 14, 2022

Hi Yongxu, thanks for the impressive questions! These are good questions to discuss about~

  1. For the first question, the disentanglement affects more like masked representation learning, and the two decomposed views can be regarded as appropriate masks for this task. Similar strategies are widely attempted in high-level tasks, and our design in UGC-VQA finds it also successful.

  2. For the second question, yeah, we will try with more backbones! This was also suggested by my co-authors and we planned to release results for different backbones in this repo. Stay tuned!

Further discussions you might contact my Wechat: haoningnanyangtu (also open for all friends in this topic).

@Sissuire
Copy link
Author

@teowu Thanks for your kindly repely. This is just an open discussion, and as far as you know, is there any literature claiming that disentanglement representations perform better than the entangled ones? (thus we must do representation entanglement)

@teowu teowu pinned this issue Nov 14, 2022
@teowu
Copy link
Member

teowu commented Nov 14, 2022

"thus we must do representation entanglement" Not a must lah.
I think our main goal is to learn aesthetic and technical quality opinions from the overall one, and for improvement this is just one way among all: adding heavier branches without View Decomposition might also work, but is against our wish on more explainable representation learning.

About disentanglement will enhance representations, I quite think this is a common idea in higher-level tasks (our related works also cited some) Most most recently, I read a paper in this year's NeurIPS sharing similar ideas, but I cannot find it now...should I find it I will ping its link here..

@teowu
Copy link
Member

teowu commented Nov 14, 2022

BTW I like this discussion so I ping it here (as if I were on OpenReview for ICLR or NeuRIPS lol)

@teowu teowu changed the title Very impressive work! Very impressive work! (Feel free to discuss about the paper here!) Nov 14, 2022
@Sissuire
Copy link
Author

@teowu Great thanks :)

@teowu teowu added the good first issue Good for newcomers label Nov 14, 2022
@allexeyj
Copy link

allexeyj commented Jul 22, 2023

@Sissuire @teowu
I am glad that there is an opportunity to discuss this wonderful article. I did not find some things in the article, please help.

Question regarding finetuning DOVER for VQA datasets which has only overall video quality. Let's take KoNViD-1k as an example. If you look in meta-data, you can see that there is only the overall video quality (hereinafter Qo). There is no technical (hereinafter Qt ) or aesthetic (hereinafter Qa ) video quality. But if you look at the labels for this dataset, you can see that there are three values for prediction, it seems that these are Qa, Qt and Qo. How did the authors get Qa and Qt if there were only Qo in the original dataset? How data in labels.txt was obtained?

And i found that some labels.txt files has following structure: -1, -1, MOS(Qo). If i have no Qa and Qt, but have Qo, i can just set Qa and Qt to -1 in labels.txt?

@teowu
Copy link
Member

teowu commented Jul 27, 2023

Hi Alex, the Q_o is all where this is. The DIVIDE-3k database (the only database with Q_a and Q_t, as we proposed) will be released soon. In https://github.com/VQAssessment/DOVER/blob/master/examplar_data_labels/KoNViD/labels.txt, the second and third value are video length and framerate, which is deprecated in other datasets, so left with -1 as placeholders.

@teowu teowu closed this as completed Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants