(EAI-353): Quiz question evaluation #440

mongodben · 2024-06-20T14:18:28Z

Jira: https://jira.mongodb.org/browse/EAI-353

Changes

In mongodb-chatbot-evaluation package:

Quiz question data generator (i.e. get response from model)
Quiz question evaluator (i.e. is model response correct)
Have pipeline functions return metadata with possibility of additional info.
Add createdAt field to GeneratedData

New model-eval package for running evals on LLMs including:

Create package w/ package config (based on chatbot-eval-mongodb-public)
Config to run model quiz question evaluations.
Support calling multiple models thru Radiant.
- Currently just Azure OpenAI models. Hopefully will have more soon...

In chatbot-eval-mongodb-public package:

Support new pipeline function format.

Notes

packages/mongodb-chatbot-evaluation/package.json

nlarew

LGTM w/ a few non-blocking comments

packages/model-eval/.env.example

packages/model-eval/src/radiantModels.ts

packages/mongodb-chatbot-evaluation/src/test/mockGenerateDataFunc.ts

nlarew · 2024-07-08T19:16:22Z

packages/mongodb-chatbot-evaluation/src/generate/generateLlmQuizQuestionAnswer.ts

+  This can be useful for evaluating how an LLM performance on the subject matter of the multiple choice questions.
+
+  The prompt is based on this blog post from Hugging Face: https://huggingface.co/blog/open-llm-leaderboard-mmlu
+  It follows the [HELM prompt format](https://huggingface.co/blog/open-llm-leaderboard-mmlu#mmlu-comes-in-all-shapes-and-sizes-looking-at-the-prompts).


Would it be worth splitting this out into a reusable makeHelmPrompt() function w/ accompanying tests? Not necessary for this PR but might be useful down the line + as documentation of the format.

HELM is just for quiz-style questions, the current makeQuizQuestionPrompt() does what you're describing. i can refactor comments a bit to make this clearer + also export this function and test it

quiz question data generator and evaluator

59f0f2b

mongodben changed the title ~~(EAI-353): Quiz question evaluatin~~ (EAI-353): Quiz question evaluation Jun 20, 2024

mongodben added 6 commits June 20, 2024 14:56

start implementation

0e1c773

make separate model eval pkg

8ea8911

stub out e2e

8310db9

working e2e

16e28a7

lock fix

fc0f39c

fix build errs

eab183d

mongodben marked this pull request as ready for review July 2, 2024 16:56

nlarew reviewed Jul 8, 2024

View reviewed changes

packages/mongodb-chatbot-evaluation/package.json Outdated Show resolved Hide resolved

mongodben commented Jul 8, 2024

View reviewed changes

packages/mongodb-chatbot-evaluation/package.json Outdated Show resolved Hide resolved

Apply suggestions from code review

2b0b8f6

nlarew approved these changes Jul 8, 2024

View reviewed changes

mongodben added 2 commits July 9, 2024 15:17

implement NL feedback

0df4456

add missing test

2915f15

mongodben merged commit 364b01f into main Jul 9, 2024
1 check passed

mongodben deleted the EAI-353 branch July 9, 2024 19:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(EAI-353): Quiz question evaluation #440

(EAI-353): Quiz question evaluation #440

mongodben commented Jun 20, 2024 •

edited

Loading

nlarew left a comment

nlarew Jul 8, 2024

mongodben Jul 9, 2024

(EAI-353): Quiz question evaluation #440

(EAI-353): Quiz question evaluation #440

Conversation

mongodben commented Jun 20, 2024 • edited Loading

Changes

Notes

nlarew left a comment

Choose a reason for hiding this comment

nlarew Jul 8, 2024

Choose a reason for hiding this comment

mongodben Jul 9, 2024

Choose a reason for hiding this comment

mongodben commented Jun 20, 2024 •

edited

Loading