Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce experimental results. #7

Open
Taeyoung-Jang opened this issue Apr 21, 2024 · 1 comment
Open

Reproduce experimental results. #7

Taeyoung-Jang opened this issue Apr 21, 2024 · 1 comment

Comments

@Taeyoung-Jang
Copy link

Taeyoung-Jang commented Apr 21, 2024

Thank you for your great work!

I just ran the evaluation pipeline and checked the pass rates for toolllama v2, gpt3.5-turbo, and gpt4-turbo. However, all the pass rates are significantly lower than the scores presented in the experiment.

I have confirmed that gpt4-turbo is being used both on the server and during the evaluation process. Are there any considerations that should be taken into account during the inference process to obtain results?

I am curious if there are any hyperparameters used to achieve results similar to those obtained in the experiment. (I think there can be an error margin of up to 5% in reproducing the experiment.)

@zhichengg
Copy link
Collaborator

zhichengg commented Apr 26, 2024

Hi, Thank you for your interest in this work.

We are experiencing two issues currently that may cause the reproducibility problem:

  • Firstly, the real API server maintained by the ToolBench team is faced with instability problems. Many of the calls to real APIs returned 500 as reported by other users. We are investigating this and will hopefully fix it soon. You can double check your replicated trajectories to see whether you are facing this problem.
  • Secondly, the OpenA updated their gpt-4-turbo models this month. With the new model as the evaluator, the performance will systematically drop. We used gpt-4-turbo-preview in our experiments but the behaviour of this model also changed a lot. We will soon update the model performance with gpt-4-turbo-2024-04-09 and publish our model inference. We are also training our own evaluator model with an open-source model to replace these closed-source models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants