Skip to content

Releases: Azure-Samples/ai-rag-chat-evaluator

2024-06-05: Update to new AI Chat Protocol, increase flexibility

05 Jun 18:28
3aa7f95
Compare
Choose a tag to compare

This release both updates the evaluator tool to assume that the chat backend conforms to the new Microsoft AI Chat Protocol but also adds two new properties to the config JSON to allow for use with backends using the older protocol or other protocols.

Just add these fields and customize the JMESPath expressions as needed:

    "target_response_answer_jmespath": "choices[0].message.content",
    "target_response_context_jmespath": "choices[0].context.data_points.text"

What's Changed

  • Bump promptflow-evals from 0.2.0.dev0 to 0.3.0 by @dependabot in #88
  • Avoid rate-limiting, improve --changed by @pamelafox in #91
  • Update to new Protocol response format, use JMESPath expressions by @pamelafox in #92

Full Changelog: 2024-05-13...2024-06-05

2024-05-13: Updated underlying SDK, new metrics

13 May 20:16
1bbec13
Compare
Choose a tag to compare

This release ports this tool to use the promptflow-evals SDK for the evaluation functionality, as the evaluate functionality is being deprecated in azure-ai-generative. The Q&A generation is still in azure-ai-generative for now.

Some user-facing changes:

  • I renamed the custom metrics to "mygroundedness", "myrelevance", "mycoherence" to make it clear they're not the built-in metrics. If you previously generated custom metrics, you'll want to rename keys in evalresults.json to the keys above and rename requested metrics in your config.json file.
  • I added more built-in metrics from the promptflow-evals SDK: fluency, similarity, f1score.

What's Changed

New Contributors

Full Changelog: 2024-03-05...2024-05-13

2024-03-05: Diff tool for single run

05 Mar 19:51
e94223b
Compare
Choose a tag to compare

This PR makes the diff tool more flexible, so that you can now specify a single directory and it will diff against the ground truth answer in that case:

For example:

python -m review_tools diff example_results/baseline/

What's Changed

Full Changelog: 2024-03-04...2024-03-05

2024-03-04: Evaluate "I don't know" situations

04 Mar 23:41
25b06de
Compare
Choose a tag to compare

The tools now support evaluating your app's ability to say "I don't know". See README:
https://github.com/Azure-Samples/ai-rag-chat-evaluator?tab=readme-ov-file#measuring-apps-ability-to-say-i-dont-know

There's also a new metric, citationmatch, to check whether an answer's citations contains the original citation from the ground truth answer.

What's Changed

  • Handle failures with numeric ratings by @pamelafox in #53
  • Added citationmatch metric and tools for evaluating "I don't know" answers by @pamelafox in #54

Full Changelog: 2024-02-15...2024-03-04

2024-02-15: Upgrade azure-ai-generative SDK, custom Prompt metrics

16 Feb 06:32
a7e5717
Compare
Choose a tag to compare

This week's release upgraded the azure-ai-generative SDK version, which introduced a regression that's now fixed.

The evaluator tool also now has the ability to run custom prompt metrics, which is particularly helpful if you need to localize the built-in prompts. See documentation here:
https://github.com/Azure-Samples/ai-rag-chat-evaluator/tree/main?tab=readme-ov-file#custom-metrics

What's Changed

New Contributors

Full Changelog: https://github.com/Azure-Samples/ai-rag-chat-evaluator/commits/2024-02-15