I can't reproduce many original performances for Qwen_VL. Who can help me? #374

xiaochengsky · 2024-10-28T21:49:52Z

A descriptive title: How to reproduce performance in Scienceqa_img for Qwen_VL?
Can't reproduce performance in default config.

Hi, Thank you for this great project!
When I used the default configuration to reproduce performance in Scienceqa_img for Qwen_VL, I got a worse accuracy, only 46.7...

Here is my command:
python lmms_eval/__main__.py --model qwen_vl --model_args pretrained="Qwen/Qwen-VL" --tasks scienceqa_img --batch_size 1 --log_samples --log_samples_suffix qwenvl_scienceqa_img --output_path ./logs/

and then, I got the result:

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
scienceqa_img	Yaml	none	0	exact_match	0.467	±	0.0111

Maybe, the prompt does not match? I also printed some results, and I found there is some redundant information (explanation: .......), like this:

Model Responding: 0%|▎ | 4/2017 [00:01<12:17, 2.73it/s]text_outputs: B
Explanation: The passage says that Ernest put a parachute with a 1
Model Responding: 0%|▎ | 5/2017 [00:02<14:32, 2.31it/s]text_outputs: B
Explanation: The passage says that Gordon put a parachute with a 1
Model Responding: 0%|▍ | 6/2017 [00:02<15:55, 2.11it/s]text_outputs: B
Model Responding: 0%|▌ | 7/2017 [00:03<12:50, 2.61it/s]text_outputs: A
Model Responding: 0%|▌ | 8/2017 [00:03<10:50, 3.09it/s]text_outputs: B
Explanation: The passage says that Sebastian put a parachute with a 1
Model Responding: 0%|▋ | 9/2017 [00:03<13:14, 2.53it/s]text_outputs: A
Model Responding: 0%|▋ | 10/2017 [00:04<11:09, 3.00it/s]text_outputs: C

So, has anyone had these problems?

Thanks!

The text was updated successfully, but these errors were encountered:

xiaochengsky · 2024-10-28T22:30:19Z

OCRBench is also lowly:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
ocrbench	Yaml	none	0	ocrbench_accuracy	↑	0.133	±	N/A

But Ai2d looks normal:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
ai2d	Yaml	flexible-extract	0	exact_match	↑	0.5858	±	0.0089

kcz358 · 2024-10-29T02:13:36Z

By default, we use chat template to help us format the prompt for Qwen_VL. In previous development, we did some tries to align the result with ai2d for Qwen_VL by changing some prompt. However, it is not possible for every dataset and we do not feel this is the correct way or fair for evaluation, which is also the initiative for this project. If you want to exactly match the result, you may need to change how we format context in the Qwen_VL to their eval scripts in here

xiaochengsky changed the title ~~How to produce performance in Scienceqa_img for Qwen_VL?~~ How to reproduce original performance in Scienceqa_img for Qwen_VL? Oct 29, 2024

xiaochengsky changed the title ~~How to reproduce original performance in Scienceqa_img for Qwen_VL?~~ I can't reproduce original performances in some datasets for Qwen_VL. Oct 29, 2024

xiaochengsky changed the title ~~I can't reproduce original performances in some datasets for Qwen_VL.~~ I can't reproduce many original performances for Qwen_VL. Who can help me? Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I can't reproduce many original performances for Qwen_VL. Who can help me? #374

I can't reproduce many original performances for Qwen_VL. Who can help me? #374

xiaochengsky commented Oct 28, 2024 •

edited

Loading

xiaochengsky commented Oct 28, 2024

kcz358 commented Oct 29, 2024

I can't reproduce many original performances for Qwen_VL. Who can help me? #374

I can't reproduce many original performances for Qwen_VL. Who can help me? #374

Comments

xiaochengsky commented Oct 28, 2024 • edited Loading

xiaochengsky commented Oct 28, 2024

kcz358 commented Oct 29, 2024

xiaochengsky commented Oct 28, 2024 •

edited

Loading