Error in BLUE calculation #4

ghaddarAbs · 2022-03-31T00:40:26Z

Hi,

Thanks for sharing the code and models for your great work!!!

I tried to reproduce the BLUE scores using your code but could not get the 5% BLUE score you reported in the paper.
Based on your code and experiments we did internally, it seems that you calculate the bleu score on the segmented text (with farasa segment). This roughly gives a BLUE score which is x2.5 times larger than what it should be. We found that in the best case (after extensive hyperparameters tuning (>40 experiments) the best BLUE score is around 1.6-1.9.

The bug is in the compute_metrics method:

def compute_metrics(pred):
  labels_ids = pred.label_ids
  #pred_ids = torch.argmax(pred.predictions,dim=2)
  pred_ids = pred.predictions  

  # all unnecessary tokens are removed
  pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
  labels_ids[labels_ids == -100] = tokenizer.pad_token_id
  label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

#########################################################
### You should add these lines to calculate the BLUE score correctly  ######
#########################################################
  response = response.replace("\'", '')
  response = response.replace("[[CLS]", '')
  response = response.replace("[SEP]]", '')
  response = str(arabert_prep.desegment(response)) # this is the most important one
########################################


  return {"bleu": round(corpus_bleu(pred_str , [label_str]).score, 4)}

Here is an example of how BLUE scores vary between the same segmented vs. desegmented text:

t1 = "أنا متأكد من أنها ستكون بخير."
ppt1 = arabert_prep.preprocess(t1)
print(ppt1) # أنا متأكد من أن +ها س+ تكون ب+ خير .

t2 = "أنا متأكد ستكون بخير."
ppt2 = arabert_prep.preprocess(t2)
print(ppt2) # أنا متأكد س+ تكون ب+ خير .


print("Wrong BLUE on segmenented text", sacrebleu.sentence_bleu(ppt2, [ppt1]).score) # 51.51425457345961
print("Correct BLUE on unsegmenented text", sacrebleu.sentence_bleu(t2, [t1]).score) # 33.51600230178196

Thanks,

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in BLUE calculation #4

Error in BLUE calculation #4

ghaddarAbs commented Mar 31, 2022 •

edited

Loading

Error in BLUE calculation #4

Error in BLUE calculation #4

Comments

ghaddarAbs commented Mar 31, 2022 • edited Loading

ghaddarAbs commented Mar 31, 2022 •

edited

Loading