Releases: svilupp/Julia-LLM-Leaderboard
Releases · svilupp/Julia-LLM-Leaderboard
v0.2.0
Added
- Added new models (OpenAI "0125" versions, Codellama, and more)
- Capability to evaluate code with AgentCodeFixer loop (set
codefixing_num_rounds>0
) - Automatically set a different seed for commercial API providers (MistralAI, OpenAI) to avoid their caching mechanism
- Re-scored all past submissions with the new methodology
Fixed
- Improved code loading and debugging via Julia's code loading mechanism (
include_string
), which allows to better locate the lines that caused the errors (runevaluate(....; verbose=true)
to see which lines caused the errors orreturn_debug=true
to return the debug information as a secondary output). - Improved error capture and scoring (eg, imports of Base modules are now correctly recognized as "safe")
- Improved detection of parse errors (ie, reduces score of submissions that "executed" only because I didn't detect the parsing error earlier)
- Fixed
mkdir
bug inrun_benchmark
Removed
@timeout
macro has been upstreamed to PromptingTools
Case Studies
- Quantization effects on Yi34b and Magicoder 7b
- Effect of English vs Chinese on performance with Yi34b