Few-shot SuperGLUE的部分数据集效果复现问题 #15

Riroaki · 2021-05-01T12:20:07Z

您好，我在复现Few-shot SuperGLUE（即FewGLUE_32dev数据）实验时，CB、WSC、COPA数据集的结果和论文中存在一定差距（复现实验所有模型均基于albert-xxlarge-v2这一个预训练模型，与论文设计一致，实验seed=42无修改）：

实验设置差异：

关于CB数据集的实验

原始脚本中使用8卡/pet_per_gpu_train_batch_size=2/pet_gradient_accumulation_steps=1，在我的实验中使用1卡/pet_per_gpu_train_batch_size=8/pet_gradient_accumulation_steps=2，其余参数无差异；
最终结果acc最高85.71，f1-macro对应78.76，论文结果为92.9/92.3；
在项目的issue中我找到您关于CB数据集效果不如论文的解释：https://github.com/THUDM/P-tuning/issues/12，如果是脚本参数有误造成的，请问什么时候会更新训练脚本呢？

关于WSC数据集的实验

与原脚本参数无差异（1卡/pet_per_gpu_train_batch_size=16/pet_gradient_accumulation_steps=1）；
最终结果acc最高81.73，论文结果为84.6；

关于COPA数据集的实验

与原脚本参数无差异（1卡/pet_per_gpu_train_batch_size=16/pet_gradient_accumulation_steps=1）；
最终结果acc最高79.00，论文结果为87.0；

python库版本差异

考虑到可能存在版本差异影响造成复现效果不同，在此列出与requirements.txt对应的python库版本（括号中为项目requirements的库版本）：

numpy 1.19.5（1.19）
jsonpickle 2.0.0（1.1）
scikit-learn 0.24.1（0.23.1）
torch 1.7.1+cu110（1.5.0）
torchvision 0.8.2+cu110（0.6.0）
transformers 4.5.1（3.0.2）
tqdm 4.49.0（4.48.1）
tensorboardX 2.2（2.1）
由于设备cuda版本受限，torch相关库的版本与代码不同；而其他部分库如tqdm、tensorboardX等应该与效果无关。
不知道是否是因为以上库版本差异导致效果不同？

设备差异

全部复现实验在单张GeForce RTX 3090上进行。

请问如何理解模型效果的差异？

The text was updated successfully, but these errors were encountered:

slczgwh · 2021-05-08T08:50:34Z

正好有同样的问题想问。我这边在SuperGLUE上的实验发现有几个数据集分数与随机数种子有很高的关联性（与使用的代码关联性就更高了，用Jiant和Allennlp跑出来分数差异也有几个点）。CB这个数据集甚至能从70多波动到90多。不知道作者是怎么处理这些随机因素的？

slczgwh · 2021-05-08T08:58:56Z

CB这个数据集，只用BERT-BASE-UNCASE跑十次随机数种子，差别也能到这个程度（Jiant的结果）。

f13_bb

0.912281
0.866667
0.867925
0.945455
0.867925
0.857143
0.915254
0.912281
0.836364
0.912281

ywb2018 · 2021-09-26T09:44:18Z

请问下你跑的时候emb size设置的是768吗，其他代码有改动吗？我这边跑的rte，指标很低只有三四十不知为何

rookiebird · 2021-10-09T03:44:49Z

请问下你跑的时候emb size设置的是768吗，其他代码有改动吗？我这边跑的rte，指标很低只有三四十不知为何

确实比较神奇，我试着用cb这个script 跑了，发现报错，prompt embedding 默认值是128，因此替换bert embedding对不上，但是cb 这个script 它又不指定embedding 这个参数值？

Xiao9905 · 2021-12-07T11:21:43Z

Thanks for your great work in reproducing P-tuning for few-shot SuperGLUE. In practice, we find few-shot learning's reproducibility extremely relates with environmental setting, hyper-parameters (e.g., batch-sizes, gradient-accumulation-step) and number of parallel GPUs. For example, in our experiment we use 8 V100 GPUs for a single dataset training, and if less GPUs or different type of GPUs are used, the performance can varies greatly.

In light of the volatility challenge, in the following work FewNLU @zheng-yanan present a more robust evaluation framework for few-shot SuperGLUE. P-tuning is also re-implemented in the FewNLU framework. Please check it if you have trouble setting up the same environment for fair comparison.

SCU-JJkinging · 2022-05-25T06:17:55Z

rompt embedding

请问prompt embedding 的大小需要设置和预训练模型的embedding_dim一样吗？直接拿作者的代码跑，会报错，维度不匹配

lovekittynine mentioned this issue Nov 21, 2022

Doesn't anyone see a problem with the code about the prompt construction for BERT-style transformer??? #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Few-shot SuperGLUE的部分数据集效果复现问题 #15

Few-shot SuperGLUE的部分数据集效果复现问题 #15

Riroaki commented May 1, 2021

slczgwh commented May 8, 2021

slczgwh commented May 8, 2021

ywb2018 commented Sep 26, 2021

rookiebird commented Oct 9, 2021

Xiao9905 commented Dec 7, 2021

SCU-JJkinging commented May 25, 2022

Few-shot SuperGLUE的部分数据集效果复现问题 #15

Few-shot SuperGLUE的部分数据集效果复现问题 #15

Comments

Riroaki commented May 1, 2021

实验设置差异：

关于CB数据集的实验

关于WSC数据集的实验

关于COPA数据集的实验

python库版本差异

设备差异

slczgwh commented May 8, 2021

slczgwh commented May 8, 2021

f13_bb

ywb2018 commented Sep 26, 2021

rookiebird commented Oct 9, 2021

Xiao9905 commented Dec 7, 2021

SCU-JJkinging commented May 25, 2022