evaluation
Here are 1,147 public repositories matching this topic...
🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
-
Updated
Jul 9, 2024 - TypeScript
🤖 Build AI applications with confidence ✅ DSPy Visualizer ✅ Understand how your users are using your LLM-app ✅ Get a full picture of the quality performance of your LLM-app ✅ Collaborate with your stakeholders in ONE platform ✅ Iterate towards the most valuable & reliable LLM-app.
-
Updated
Jul 9, 2024 - TypeScript
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
-
Updated
Jul 9, 2024 - Go
Valor is a centralized evaluation store which makes it easy to measure, explore, and rank model performance.
-
Updated
Jul 9, 2024 - Python
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
-
Updated
Jul 9, 2024 - Python
Chess engine
-
Updated
Jul 9, 2024 - C++
Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 50+ HF models, 20+ benchmarks
-
Updated
Jul 9, 2024 - Python
Pip compatible CodeBLEU metric implementation available for linux/macos/win
-
Updated
Jul 9, 2024 - Python
Machine Learning Experiment Manage Platform
-
Updated
Jul 9, 2024 - Python
Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
-
Updated
Jul 9, 2024 - TypeScript
R Package for preprocessing, normalizing, and analyzing proteomics data
-
Updated
Jul 9, 2024 - R
A streamlined and customizable framework for efficient large model evaluation and performance benchmarking
-
Updated
Jul 9, 2024 - Python
Official Implementation of ACL2024 paper "Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs"(https://arxiv.org/abs/2402.11199).
-
Updated
Jul 9, 2024 - Python
A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems
-
Updated
Jul 9, 2024
A task generation and model evaluation system.
-
Updated
Jul 9, 2024 - Python
Moodle plugin for evaluations with Moodle. This is the evaluation activity plugin.
-
Updated
Jul 9, 2024 - PHP
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
-
Updated
Jul 9, 2024
Improve this page
Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."