Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I can't reproduce many original performances for Qwen_VL. Who can help me? #374

Open
2 tasks
xiaochengsky opened this issue Oct 28, 2024 · 2 comments
Open
2 tasks

Comments

@xiaochengsky
Copy link

xiaochengsky commented Oct 28, 2024

  • A descriptive title: How to reproduce performance in Scienceqa_img for Qwen_VL?
  • Can't reproduce performance in default config.

Hi, Thank you for this great project!
When I used the default configuration to reproduce performance in Scienceqa_img for Qwen_VL, I got a worse accuracy, only 46.7...

Here is my command:
python lmms_eval/__main__.py --model qwen_vl --model_args pretrained="Qwen/Qwen-VL" --tasks scienceqa_img --batch_size 1 --log_samples --log_samples_suffix qwenvl_scienceqa_img --output_path ./logs/

and then, I got the result:

Tasks Version Filter n-shot Metric Value Stderr
scienceqa_img Yaml none 0 exact_match 0.467 ± 0.0111

Maybe, the prompt does not match? I also printed some results, and I found there is some redundant information (explanation: .......), like this:

Model Responding: 0%|▎ | 4/2017 [00:01<12:17, 2.73it/s]text_outputs: B
Explanation: The passage says that Ernest put a parachute with a 1
Model Responding: 0%|▎ | 5/2017 [00:02<14:32, 2.31it/s]text_outputs: B
Explanation: The passage says that Gordon put a parachute with a 1
Model Responding: 0%|▍ | 6/2017 [00:02<15:55, 2.11it/s]text_outputs: B
Model Responding: 0%|▌ | 7/2017 [00:03<12:50, 2.61it/s]text_outputs: A
Model Responding: 0%|▌ | 8/2017 [00:03<10:50, 3.09it/s]text_outputs: B
Explanation: The passage says that Sebastian put a parachute with a 1
Model Responding: 0%|▋ | 9/2017 [00:03<13:14, 2.53it/s]text_outputs: A
Model Responding: 0%|▋ | 10/2017 [00:04<11:09, 3.00it/s]text_outputs: C

So, has anyone had these problems?

Thanks!

@xiaochengsky
Copy link
Author

OCRBench is also lowly:

Tasks Version Filter n-shot Metric Value Stderr
ocrbench Yaml none 0 ocrbench_accuracy 0.133 ± N/A

But Ai2d looks normal:

Tasks Version Filter n-shot Metric Value Stderr
ai2d Yaml flexible-extract 0 exact_match 0.5858 ± 0.0089

@xiaochengsky xiaochengsky changed the title How to produce performance in Scienceqa_img for Qwen_VL? How to reproduce original performance in Scienceqa_img for Qwen_VL? Oct 29, 2024
@xiaochengsky xiaochengsky changed the title How to reproduce original performance in Scienceqa_img for Qwen_VL? I can't reproduce original performances in some datasets for Qwen_VL. Oct 29, 2024
@xiaochengsky xiaochengsky changed the title I can't reproduce original performances in some datasets for Qwen_VL. I can't reproduce many original performances for Qwen_VL. Who can help me? Oct 29, 2024
@kcz358
Copy link
Collaborator

kcz358 commented Oct 29, 2024

By default, we use chat template to help us format the prompt for Qwen_VL. In previous development, we did some tries to align the result with ai2d for Qwen_VL by changing some prompt. However, it is not possible for every dataset and we do not feel this is the correct way or fair for evaluation, which is also the initiative for this project. If you want to exactly match the result, you may need to change how we format context in the Qwen_VL to their eval scripts in here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants