Stencila Evaluations and Benchmarking
👋 Intro • 🚴 Roadmap • 🛠️ Develop 🙏 Acknowledgements • 💖 Supporters
Welcome to the repository for Stencila's LLM evaluations and benchmarking. This is in early development and consolidates related code we have had in other repos.
We plan the following three main methodologies to evaluating LLMs for science-focussed prompts and tasks. To avoid discontinuities, we are likely to use a weighting approach, in which we gradually increase the weight of the more advanced methodologies as they are developed.
Collate external benchmarks and map prompts to each. For example, combine scores from LiveBench's coding benchmark and Aider's code editing benchmark into a single code-quality
score and use for stencila/create/code-chunk
, stencila/create/figure-code
and other code-related prompts.
Establish a pipeline for evaluating prompts themselves, and which LLMs are best suited to each prompt, using LLM-as-a-jury and other methods for machine-based evaluation.
Use data from user's acceptance and refinement of AI suggestions within documents as the basis for human-based evaluations.
For development, you’ll need to install the following dependencies:
Then, the following will get you started with a development environment:
just init
Once uv
is installed, you can use it to install some additional tools:
uv tool install ruff
uv tool install pyright
The justfile
has some common development-related commands that you might want to run.
For example, the check
command runs all linting and tests:
just check
To run anything within the virtual environment, you need to use uv run <command>
.
Alternatively, you can install direnv, and have the virtual environment activated automatically.
See here for more details about using direnv and uv together.
Thank you to the following projects whose code and/or data we rely on:
We are grateful for the support of the Astera Institute for this work.