Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in BLUE calculation #4

Open
ghaddarAbs opened this issue Mar 31, 2022 · 0 comments
Open

Error in BLUE calculation #4

ghaddarAbs opened this issue Mar 31, 2022 · 0 comments

Comments

@ghaddarAbs
Copy link

ghaddarAbs commented Mar 31, 2022

Hi,

Thanks for sharing the code and models for your great work!!!

I tried to reproduce the BLUE scores using your code but could not get the 5% BLUE score you reported in the paper.
Based on your code and experiments we did internally, it seems that you calculate the bleu score on the segmented text (with farasa segment). This roughly gives a BLUE score which is x2.5 times larger than what it should be. We found that in the best case (after extensive hyperparameters tuning (>40 experiments) the best BLUE score is around 1.6-1.9.

The bug is in the compute_metrics method:

def compute_metrics(pred):
  labels_ids = pred.label_ids
  #pred_ids = torch.argmax(pred.predictions,dim=2)
  pred_ids = pred.predictions  

  # all unnecessary tokens are removed
  pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
  labels_ids[labels_ids == -100] = tokenizer.pad_token_id
  label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

#########################################################
### You should add these lines to calculate the BLUE score correctly  ######
#########################################################
  response = response.replace("\'", '')
  response = response.replace("[[CLS]", '')
  response = response.replace("[SEP]]", '')
  response = str(arabert_prep.desegment(response)) # this is the most important one
########################################


  return {"bleu": round(corpus_bleu(pred_str , [label_str]).score, 4)}

Here is an example of how BLUE scores vary between the same segmented vs. desegmented text:

t1 = "أنا متأكد من أنها ستكون بخير."
ppt1 = arabert_prep.preprocess(t1)
print(ppt1) # أنا متأكد من أن +ها س+ تكون ب+ خير .

t2 = "أنا متأكد ستكون بخير."
ppt2 = arabert_prep.preprocess(t2)
print(ppt2) # أنا متأكد س+ تكون ب+ خير .


print("Wrong BLUE on segmenented text", sacrebleu.sentence_bleu(ppt2, [ppt1]).score) # 51.51425457345961
print("Correct BLUE on unsegmenented text", sacrebleu.sentence_bleu(t2, [t1]).score) # 33.51600230178196

Thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant