The OpenCompass team is thrilled to announce the release of OpenCompass v0.3.9!
π Highlights
β¨ This version introduces a number of new features and improvements that enhance the user experience and expand the capabilities of OpenCompass. Notable changes include support for G-Pass@k and LiveMathBench, as well as the introduction of the Bradley-Terry subjective evaluation method.
π New Features
-π Support for G-Pass@k and LiveMathBench metrics to better evaluate model performance. (#1772)
-π Theorem QA 0shot CoT configuration has been added for more comprehensive evaluation scenarios. (#1783)
-π A customizable tokenizer for RULER offers greater flexibility in processing inputs. (#1731)
-π Added LiveStemBench Dataset to enrich our collection of datasets. (#1794)
-π Integration of JudgeLLM into o1 evaluation for improved assessment accuracy. (#1795)
-π Implementation of the Bradley-Terry subjective evaluation method on wildbench, alpacaeval, and compassarena datasets. (#1791)
π Documentation
-π Updated OC academic content to the most recent information as of December 2024. (#1771)
π Bug Fixes
-π§ Fixed Order error which was causing issues with sequence handling. (#1767)
-π§ Resolved an issue where the lark report was returning None. (#1769)
-π§ Corrected the path for saving Local Runner parameters. (#1768)
-π§ Amended the summarizer abbreviation for models to ensure proper identification. (#1789)
-π§ Fixed output_path errors to improve file handling reliability. (#1798)
β Enhancements and Refactors
-πͺ Fullbench testcase has been integrated into the CI pipeline. (#1766)
Volc status exception handling has been updated for more robust responses. (#1780)
-πͺ Removed daily step retry mechanism and updated PR score calculation for efficiency. (#1782)
-πͺDeploy Python version has been updated to the latest stable release. (#1784)
-πͺPypi deploy workflow has been refined for smoother deployments. (#1786)
Thank you for being part of the OpenCompass community! Your support and contributions make each release possible.