Releases: Azure-Samples/ai-rag-chat-evaluator
2024-06-05: Update to new AI Chat Protocol, increase flexibility
This release both updates the evaluator tool to assume that the chat backend conforms to the new Microsoft AI Chat Protocol but also adds two new properties to the config JSON to allow for use with backends using the older protocol or other protocols.
Just add these fields and customize the JMESPath expressions as needed:
"target_response_answer_jmespath": "choices[0].message.content",
"target_response_context_jmespath": "choices[0].context.data_points.text"
What's Changed
- Bump promptflow-evals from 0.2.0.dev0 to 0.3.0 by @dependabot in #88
- Avoid rate-limiting, improve --changed by @pamelafox in #91
- Update to new Protocol response format, use JMESPath expressions by @pamelafox in #92
Full Changelog: 2024-05-13...2024-06-05
2024-05-13: Updated underlying SDK, new metrics
This release ports this tool to use the promptflow-evals SDK for the evaluation functionality, as the evaluate functionality is being deprecated in azure-ai-generative. The Q&A generation is still in azure-ai-generative for now.
Some user-facing changes:
- I renamed the custom metrics to "mygroundedness", "myrelevance", "mycoherence" to make it clear they're not the built-in metrics. If you previously generated custom metrics, you'll want to rename keys in evalresults.json to the keys above and rename requested metrics in your config.json file.
- I added more built-in metrics from the promptflow-evals SDK: fluency, similarity, f1score.
What's Changed
- Add citation_match metric, changed argument, answer reviewer by @pamelafox in #57
- Fixing typo in readme by @codemillmatt in #59
- Improving OpenAI support by @pamelafox in #62
- Pin promptflow by @pamelafox in #65
- Bump azure-ai-generative[evaluate] from 1.0.0b7 to 1.0.0b8 by @dependabot in #64
- Better debugging for evaluate command by @pamelafox in #69
- Adding azd GitHub actions workflow by @pamelafox in #70
- Fixes for running on GH by @pamelafox in #71
- Add optional target URL to evaluate command and CI to test evaluate by @pamelafox in #72
- Add more OSes to matrix by @pamelafox in #73
- Fix target URL name by @pamelafox in #74
- Add keyring setup to CI by @pamelafox in #75
- Remove unused infrastructure, disable keyed access for OpenAI service by @pamelafox in #83
- Port to new promptflow-evals SDK by @pamelafox in #85
New Contributors
- @codemillmatt made their first contribution in #59
Full Changelog: 2024-03-05...2024-05-13
2024-03-05: Diff tool for single run
This PR makes the diff tool more flexible, so that you can now specify a single directory and it will diff against the ground truth answer in that case:
For example:
python -m review_tools diff example_results/baseline/
What's Changed
- Handle single directory for diff tool by @pamelafox in #55
Full Changelog: 2024-03-04...2024-03-05
2024-03-04: Evaluate "I don't know" situations
The tools now support evaluating your app's ability to say "I don't know". See README:
https://github.com/Azure-Samples/ai-rag-chat-evaluator?tab=readme-ov-file#measuring-apps-ability-to-say-i-dont-know
There's also a new metric, citationmatch, to check whether an answer's citations contains the original citation from the ground truth answer.
What's Changed
- Handle failures with numeric ratings by @pamelafox in #53
- Added citationmatch metric and tools for evaluating "I don't know" answers by @pamelafox in #54
Full Changelog: 2024-02-15...2024-03-04
2024-02-15: Upgrade azure-ai-generative SDK, custom Prompt metrics
This week's release upgraded the azure-ai-generative SDK version, which introduced a regression that's now fixed.
The evaluator tool also now has the ability to run custom prompt metrics, which is particularly helpful if you need to localize the built-in prompts. See documentation here:
https://github.com/Azure-Samples/ai-rag-chat-evaluator/tree/main?tab=readme-ov-file#custom-metrics
What's Changed
- Bump actions/setup-python from 4 to 5 by @dependabot in #1
- Fix env var to match service setup by @pamelafox in #8
- Fix keys by @pamelafox in #14
- Support for making a directory by @pamelafox in #15
- Check for git directory before pre-commit by @pamelafox in #16
- Remove pre-commit from devcontainer.json by @pamelafox in #21
- Readme improvements by @pamelafox in #20
- Update requirements by @pamelafox in #22
- Pin AzureML metrics by @pamelafox in #25
- Address feedback from Chris by @pamelafox in #28
- Rename parameters json by @pamelafox in #30
- Better error messages, tests, and encoding by @pamelafox in #34
- Add english-only notes to readme by @pamelafox in #36
- Make numquestions an optional argument by @pamelafox in #39
- Fix numquestions correctly by @pamelafox in #40
- Add test call to GPT deployment by @pamelafox in #41
- Upgrade sdk, add latency by @pamelafox in #45
- Make windows-friendly adjustments to README by @pamelafox in #46
- Support/prioritize local prompt metrics by @pamelafox in #50
- Fix Azure OpenAI bug with API key by @sofyanajridi in #51
- Fix data mapping to match new evaluate SDK expectations by @pamelafox in #52
New Contributors
- @dependabot made their first contribution in #1
- @pamelafox made their first contribution in #8
- @sofyanajridi made their first contribution in #51
Full Changelog: https://github.com/Azure-Samples/ai-rag-chat-evaluator/commits/2024-02-15