Releases: open-compass/opencompass
0.3.9
The OpenCompass team is thrilled to announce the release of OpenCompass v0.3.9!
🌟 Highlights
✨ This version introduces a number of new features and improvements that enhance the user experience and expand the capabilities of OpenCompass. Notable changes include support for G-Pass@k and LiveMathBench, as well as the introduction of the Bradley-Terry subjective evaluation method.
🚀 New Features
-🆕 Support for G-Pass@k and LiveMathBench metrics to better evaluate model performance. (#1772)
-🆕 Theorem QA 0shot CoT configuration has been added for more comprehensive evaluation scenarios. (#1783)
-🆕 A customizable tokenizer for RULER offers greater flexibility in processing inputs. (#1731)
-🆕 Added LiveStemBench Dataset to enrich our collection of datasets. (#1794)
-🆕 Integration of JudgeLLM into o1 evaluation for improved assessment accuracy. (#1795)
-🆕 Implementation of the Bradley-Terry subjective evaluation method on wildbench, alpacaeval, and compassarena datasets. (#1791)
📖 Documentation
-📚 Updated OC academic content to the most recent information as of December 2024. (#1771)
🐛 Bug Fixes
-🔧 Fixed Order error which was causing issues with sequence handling. (#1767)
-🔧 Resolved an issue where the lark report was returning None. (#1769)
-🔧 Corrected the path for saving Local Runner parameters. (#1768)
-🔧 Amended the summarizer abbreviation for models to ensure proper identification. (#1789)
-🔧 Fixed output_path errors to improve file handling reliability. (#1798)
⚙ Enhancements and Refactors
-💪 Fullbench testcase has been integrated into the CI pipeline. (#1766)
Volc status exception handling has been updated for more robust responses. (#1780)
-💪 Removed daily step retry mechanism and updated PR score calculation for efficiency. (#1782)
-💪Deploy Python version has been updated to the latest stable release. (#1784)
-💪Pypi deploy workflow has been refined for smoother deployments. (#1786)
Thank you for being part of the OpenCompass community! Your support and contributions make each release possible.
0.3.8
The OpenCompass team is thrilled to announce the release of OpenCompass v0.3.8!
🎉 Added Features:
-🆕 DLC runner Lark report functionality has been introduced to improve reporting mechanisms. ( #1735 )
-🆕 Chinese SimpleQA configuration has been added to extend language support. ( #1697 )
-🆕 Addition of OC academic 2412 evaluation example ( #1750 )
📖 Documentation
-📚 Supplemented KOR-BENCH readme documentation for better user guidance. ( #1734 )
-📚 Updated README files for various datasets including Korbench to ensure up-to-date information. ( #1737 )
🐛 Bug Fixes
-🔧 Corrected an error in the subjective default summarizer for more accurate results. ( #1740 )
-🔧 Adjusted the max_out_len parameter for ChineseSimpleQA to fix related issues. ( #1757 )
-🔧 Resolved a problem with the transfer of the vllm max_seq_len parameter for consistent behavior. ( #1745 )
⚙ Enhancements and Refactors
-💪 Updated Manifest file to reflect the latest changes in the project. ( #1738 )
-💪 Modified Compassarena metric for enhanced performance measurement. ( #1749 )
-💪 Improved dataset configurations by removing max_out_len where not applicable. ( #1754 )
-💪 Upgraded requirement and deepseek configurations for better compatibility. ( #1764 )
🎉 Welcome New Contributors
We are pleased to welcome @OpenStellarTeam, who contributed to the addition of Chinese SimpleQA config. ( #1697 )
For a comprehensive overview of all changes, please refer to the Full Changelog.
Thank you for being part of the OpenCompass community! Your support and contributions make each release possible.
0.3.7
The OpenCompass team is thrilled to announce the release of OpenCompass v0.3.7!
🚀 New Features
- 🆕 Added support for new error code handling, improving system resilience. (#1702)
- 🆕 Introduced the P-MMEval feature for advanced model evaluations. (#1714)
- 🆕 Added LiveMathBench support for dynamic mathematical benchmarking. (#1727)
- 🆕 Included the Openai Simpleqa dataset to expand our question answering capabilities. (#1720)
📖 Documentation
- 📚 Updated configurations and documentation to reflect the latest changes, ensuring a smooth user experience. (#1704, #1717)
🐛 Bug Fixes
- 🔧 Resolved issues in output sequence generation under Turbomind model. (#1707)
- 🔧 Corrected configuration errors in pmmeval_gen to ensure proper functionality. (#1719)
⚙ Enhancements and Refactors
- 💪 Enhanced support for Arc Prize Public Evaluation. (#1690)
- 💪 Increased max_out_len parameters for various datasets to accommodate longer sequences. (#1726)
- 💪 Incorporated Korbench and updated Fullbench to provide more comprehensive benchmarking options. (#1713, #1712)
- 💪 Streamlined CI pipeline by updating the torch version and adding more datasets into daily test cases. (#1701)
🎉 Welcome New Contributors
- 👏 A warm welcome to @epsilondylan and @wanyu2018umac, who have made their first contributions by adding the Korbench dataset and introducing the P-MMEval feature respectively! (#1713, #1714)
For a complete overview of all changes, please refer to the full changelog: 0.3.6...0.3.7
0.3.6
The OpenCompass team is thrilled to announce the release of OpenCompus v0.3.6!
🌟 Highlights
✨ This release brings several updates and new features that enhance the functionality and performance of OpenCompass. Notable additions include support for long context evaluations, the introduction of the BABILong dataset, and enhancements to the MuSR dataset. We have also welcomed new contributors to our community, which we are excited to introduce.
🚀 New Features
🔥 Added long context evaluation for base models, expanding the scope of model assessments.
🔥 Introduced the BABILong dataset, enriching the resources available for research and development.
🔥 Added MUSR dataset evaluation, which evaluates language models on multistep soft reasoning tasks.
📖 Documentation
📚 Updated documentation to reflect the latest changes and features, ensuring that users can easily integrate these updates into their workflows.
🐛 Bug Fixes
🛠 Fixed issues with first_option_postprocess
to improve reliability.
🛠 Addressed bugs in the PR testing process to ensure smoother contributions from the community.
⚙ Enhancements and Refactors
🔧 Implemented auto-download for FollowBench, streamlining the setup process for new users.
🔧 Refined the CI/CD pipeline, including daily tests and baseline updates, to maintain high standards of quality and performance.
🎉 Welcome New Contributors
👏 We are delighted to welcome three new contributors who have made valuable contributions to this release:
- @MCplayerFromPRC for pushing InternTrain evaluation differences.
- @DespairL for adding single LoRA adapter support for vLLM inference.
- @abrohamLee for contributing MuSR Dataset Evaluation.
We hope you enjoy this new release and find it useful for your projects. Your feedback is always welcome and helps us improve OpenCompass continuously. Thank you for being part of our community! 🌟
Full Changelog: 0.3.5...0.3.6
0.3.5
The OpenCompass team is thrilled to announce the release of OpenCompress v0.3.5!
🌟 Highlights
- 🚀 Introduction of two new datasets: CMO&AIME, expanding our evaluation capabilities.
- 📖 Several updates to our documentation, ensuring clearer guidance for all users.
- ⚙ Several enhancements and refactoring efforts to make our codebase more robust and maintainable.
🚀 New Features
- 🆕 Added support for the CMO&AIME datasets, broadening the scope of models we can evaluate. (#1610)
- 🆕 Introduced the
CompassArenaSubjectiveBench
, a new benchmark for subjective evaluations. (#1645) - 🆕 Added configurations for the lmdeploy DeepSeek model, enhancing compatibility with cutting-edge technologies. (#1656)
📖 Documentation
- 📚 Updated the documentation to reflect the latest changes and improvements, making it easier than ever to navigate and understand. (#1655)
🐛 Bug Fixes
- 🔧 Fixed issues with the
ruler_16k_gen
component, ensuring more accurate and reliable results. (#1643) - 🔧 Resolved an error in the
get_loglikelihood
function when using lmdeploy as the accelerator. (#1659) - 🔧 Addressed problems with automatic downloads for certain datasets, streamlining the user experience. (#1652)
⚙ Enhancements and Refactors
- 💪 Enhanced the summarizer configurations for models, improving the efficiency and effectiveness of summarization tasks. (#1600)
- 💪 Added new model configurations, keeping up with the latest advancements in machine learning. (#1653)
- 💪 Updated the WildBench maximum sequence length, allowing for better handling of longer input sequences. (#1648)
- 💪 Updated the Needlebench OSS path, ensuring smoother data access and processing. (#1651)
- 💪 Improved the
mmmlu_lite
dataloader, optimizing data loading processes. (#1658)
🎉 Welcome New Contributors
- 👏 A warm welcome to @jnanliu, who has made their first contribution by adding the CMO&AIME datasets! (#1610)
For a complete overview of all changes, please refer to the full changelog: 0.3.4...0.3.5
0.3.4
The OpenCompass team is thrilled to announce the release of OpenCompass v0.3.4!
🎉 OpenCompass v0.3.4 brings major enhancements including new benchmarks, improved documentation, and numerous bug fixes.
🌈 Notable features include support for new datasets and the integration of lmdeploy pipeline API.
🔧 Support for New Datasets:
- Addition of GaoKaoMath Dataset for Evaluation.
- Support for MMMLU & MMMLU-lite Benchmark.
- Integration of Judgerbench and reorganization of subeval.
- Support for LiveCodeBench.
📝 Output Format Enhancements:
- Support for printing and saving results as markdown format tables.
🔧 Pipeline and Integration Improvements:
- Integration of lmdeploy pipeline API.
- Update of TurboMindModel through integration of lmdeploy pipeline API.
- Removal of prefix bos_token from messages when using lmdeploy as the accelerator.
🛠️ Miscellaneous Enhancements:
- Updates to the common summarizer regex extraction.
- Internal humaneval postprocess addition and updates.
📖 Documentation Updates
🐛 Bug Fixes
🎉 Welcome New Contributors
👋 New Contributors Joined the Team:
@BobTsang1995 - Contributed support for MMMLU & MMMLU-lite Benchmark.
@noemotiovon - Provided NPU support fixes.
@changlan - Fixed RULER datasets.
@BIGWangYuDong - Added support for printing and saving results as markdown format tables.
Thank you to all contributors who have made this release possible. For a complete list of changes, please see the full changelog linked below.
Full Changelog: 0.3.3...0.3.4
0.3.3
🌟 OpenCompass v0.3.3 Release Log
The OpenCompass team is thrilled to announce the release of OpenCompass v0.3.3!
🚀 New Features
- 🔧 Added support for the SciCode summarizer configuration.
- 🛠 Introduced support for internal Followbench.
- 🔧 Updated models and configurations for MathBench & WikiBench under FullBench.
- 🛠 Enhanced support for OpenAI O1 models and Qwen2.5 Instruct.
- 🔧 Included a postprocess function for custom models.
- 🛠 Added InternTrain feature for broader model training scenarios.
📖 Documentation
- 📚 Updated the README with the latest information on how to use OpenCompass effectively.
🐛 Bug Fixes
- 🔧 Fixed issues with the link-check workflow and wildbench.
- 🛠 Resolved errors in partitioning and corrected typos throughout the codebase.
- 🔧 Addressed compatibility issues with lmdeploy interface type changes.
- 🛠 Fixed the followbench dataset configuration and token settings.
⚙ Enhancements and Refactors
- 🛠 Enhanced support for verbose output in OpenAI API interactions.
- 🔧 Updated maximum output length configurations for multiple models.
- 🛠 Improved handling of the "begin section" in meta_template for better parsing.
- 🔧 Added a common summarizer for qabench and expanded test coverage for various models.
🎉 Welcome New Contributors
👋 We'd like to extend a warm welcome to our new contributors who have made their first contributions to OpenCompass:
- @x54-729 introduced InternTrain.
- @chuanyangjin helped correct typos.
- @cuauty added support for reasoning from BaiLing LLM.
Thank you to all our contributors for making this release possible!
Full Changelog: 0.3.2.post1...0.3.3
0.3.2.post1
What's Changed
- [Fix]Init import fix by @MaiziXiao in #1500
- [Bump] Bump version to 0.3.2.post1 by @MaiziXiao in #1502
Full Changelog: 0.3.2...0.3.2.post1
0.3.2
The OpenCompass team is thrilled to announce the release of OpenCompass v0.3.2!
🚀 New Features
- 🛠 Added
extra_body
support for OpenAISDK and introduced proxy URL support when connecting to OpenAI's API. - 🗂 Included auto-download functionality for Mmlu-pro, Needlebench, Longbench and other datasets.
- 🤝 Integrated support for the Rendu API.
- 🧪 Added a model postprocess function.
📖 Documentation
- 📜 Updated the README file for better clarity and guidance.
🐛 Bug Fixes
- 🛠 Fixed CLI evaluation for multiple models.
- 🛠 Updated requirements to resolve dependency issues.
- 🛠 Corrected configurations for the Llama model series.
- 🛠 Addressed bad cases and added environment information to improve testing.
⚙ Enhancements and Refactors
- 🛠 Made OPENAI_API_BASE compatible with OpenAI's default environment settings.
- 🛠 Optimized SciCode for improved performance.
- 🛠 Added an
api_key
attribute to TurboMindAPIModel. - 🛠 Implemented fixes and improvements to the CI test environment, including baselines for vllm.
🎉 Welcome New Contributors
- 👋 @cpa2001 contributed with the addition of icl_sliding_k_retriever.py and updates to __init__.py.
- 👋 @gyin94 made the OPENAI_API_BASE compatible with OpenAI's default environment.
- 👋 @chengyingshe added an attribute
api_key
into TurboMindAPIModel. - 👋 @yanzeyu supported the integration of Rendu API.
Full Changelog: 0.3.1...0.3.2
OpenCompass v0.3.1
The OpenCompass team is thrilled to announce the release of OpenCompass v0.3.1!
🌟 Highlights
- 🚀 Support pip installation, update Readme and evaluation demo
- 🐛 Fixed various dataset loading issues.
- ⚙️ Enhanced auto-download features for datasets.
🚀 New Features
- 🆕 Introduced support for Ruler datasets.
- 🆕 Enhanced model compatibility.
- 🆕 Improved dataset handling, support auto-download for various datasets
📖 Documentation
- 📚 Updated README to reflect the latest changes.
- 📚 Improved documentation for dataset loading procedures.
🐛 Bug Fixes
- 🐞 Resolved modelscope dataset load issues.
- 🐞 Corrected evaluation scores for the Lawbench dataset.
- 🐞 Fixed dataset bugs for CommonsenseQA and Longbench.
⚙ Enhancements and Refactors
- 🔧 Retained first and last halves of prompts to avoid max_seq_len issues.
- 🔧 Updated Compassbench to v1.3.
- 🔧 Switched to Python runner for single GPU operations.
🎉 Welcome New Contributors
- 🙌 @Yunnglin for fixing modelscope dataset load problem.
- 🙌 @changyeyu for addressing max_seq_len issues with prompt handling.
- 🙌 @seetimee for updates to openai_api.py.
- 🙌 @HariSeldon0 for adding the scicode dataset.
What's Changed
- [Fix] Fix modelscope dataset load problem by @Yunnglin in #1406
- [Fix] the issue where scores are negative in the Lawbench dataset evaluation(#1402) by @yaoyingyy in #1403
- [Doc] Update README by @tonysy in #1404
- Retain first and last halves of prompts to avoid max_seq_len issues by @changyeyu in #1373
- [UPDATE] Compassbench v1.3 by @MaiziXiao in #1396
- [Fix] longbench dataset load fix by @MaiziXiao in #1422
- [Fix] Sub summarizer order fix by @bittersweet1999 in #1426
- [Update] Support auto-download of FOFO/MT-Bench-101 by @tonysy in #1423
- [Bug] Commonsenseqa dataset fix by @MaiziXiao in #1425
- [Feature] Add abbr for rolebench dataset by @xu-song in #1431
- [Feature] Add Ruler datasets by @MaiziXiao in #1310
- [Fix] Fix openai api tiktoken bug for api server by @liushz in #1433
- Update openai_api.py by @seetimee in #1438
- [Feature] Add model support for 'huggingface_above_v4_33' when using '-a' by @liushz in #1430
- Add scicode by @HariSeldon0 in #1417
- [Doc] Update Readme by @MaiziXiao in #1439
- [Fix] Update option postprocess & mathbench language summarizer by @liushz in #1413
- [ci] add commond testcase into daily testcase by @zhulinJulia24 in #1447
- [Feature] Switch to python runner for single GPU by @xu-song in #1308
- [Fix] Update SciCode and Gemma model by @tonysy in #1449
- [Bump] Bump version to 0.3.1 by @tonysy in #1450
Full Changelog: 0.3.0...0.3.1
Thank you for your continued support and contributions to OpenCompass!