Skip to content

stencila/evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stencila

Stencila Evaluations and Benchmarking

👋 Intro 🚴 Roadmap 🛠️ Develop 🙏 Acknowledgements 💖 Supporters


👋 Introduction

Welcome to the repository for Stencila's LLM evaluations and benchmarking. This is in early development and consolidates related code we have had in other repos.

🚴 Roadmap

We plan the following three main methodologies to evaluating LLMs for science-focussed prompts and tasks. To avoid discontinuities, we are likely to use a weighting approach, in which we gradually increase the weight of the more advanced methodologies as they are developed.

Using external benchmarks

Collate external benchmarks and map prompts to each. For example, combine scores from LiveBench's coding benchmark and Aider's code editing benchmark into a single code-quality score and use for stencila/create/code-chunk, stencila/create/figure-code and other code-related prompts.

Using LLMs-as-a-jury etc

Establish a pipeline for evaluating prompts themselves, and which LLMs are best suited to each prompt, using LLM-as-a-jury and other methods for machine-based evaluation.

Using user acceptance and refinement data

Use data from user's acceptance and refinement of AI suggestions within documents as the basis for human-based evaluations.

🛠️ Development

For development, you’ll need to install the following dependencies:

Then, the following will get you started with a development environment:

just init

Once uv is installed, you can use it to install some additional tools:

uv tool install ruff
uv tool install pyright

The justfile has some common development-related commands that you might want to run. For example, the check command runs all linting and tests:

just check

To run anything within the virtual environment, you need to use uv run <command>. Alternatively, you can install direnv, and have the virtual environment activated automatically. See here for more details about using direnv and uv together.

🙏 Acknowledgements

Thank you to the following projects whose code and/or data we rely on:

💖 Supporters

We are grateful for the support of the Astera Institute for this work.