You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for sharing the code and models for your great work!!!
I tried to reproduce the BLUE scores using your code but could not get the 5% BLUE score you reported in the paper.
Based on your code and experiments we did internally, it seems that you calculate the bleu score on the segmented text (with farasa segment). This roughly gives a BLUE score which is x2.5 times larger than what it should be. We found that in the best case (after extensive hyperparameters tuning (>40 experiments) the best BLUE score is around 1.6-1.9.
The bug is in the compute_metrics method:
def compute_metrics(pred):
labels_ids = pred.label_ids
#pred_ids = torch.argmax(pred.predictions,dim=2)
pred_ids = pred.predictions
# all unnecessary tokens are removed
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
labels_ids[labels_ids == -100] = tokenizer.pad_token_id
label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
#########################################################
### You should add these lines to calculate the BLUE score correctly ######
#########################################################
response = response.replace("\'", '')
response = response.replace("[[CLS]", '')
response = response.replace("[SEP]]", '')
response = str(arabert_prep.desegment(response)) # this is the most important one
########################################
return {"bleu": round(corpus_bleu(pred_str , [label_str]).score, 4)}
Here is an example of how BLUE scores vary between the same segmented vs. desegmented text:
t1 = "أنا متأكد من أنها ستكون بخير."
ppt1 = arabert_prep.preprocess(t1)
print(ppt1) # أنا متأكد من أن +ها س+ تكون ب+ خير .
t2 = "أنا متأكد ستكون بخير."
ppt2 = arabert_prep.preprocess(t2)
print(ppt2) # أنا متأكد س+ تكون ب+ خير .
print("Wrong BLUE on segmenented text", sacrebleu.sentence_bleu(ppt2, [ppt1]).score) # 51.51425457345961
print("Correct BLUE on unsegmenented text", sacrebleu.sentence_bleu(t2, [t1]).score) # 33.51600230178196
Thanks,
The text was updated successfully, but these errors were encountered:
Hi,
Thanks for sharing the code and models for your great work!!!
I tried to reproduce the BLUE scores using your code but could not get the 5% BLUE score you reported in the paper.
Based on your code and experiments we did internally, it seems that you calculate the bleu score on the segmented text (with farasa segment). This roughly gives a BLUE score which is x2.5 times larger than what it should be. We found that in the best case (after extensive hyperparameters tuning (>40 experiments) the best BLUE score is around 1.6-1.9.
The bug is in the
compute_metrics
method:Here is an example of how BLUE scores vary between the same segmented vs. desegmented text:
Thanks,
The text was updated successfully, but these errors were encountered: