evaluation

🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management evals llm-evaluation

Updated Jul 9, 2024
TypeScript

xinshuoweng / AB3DMOT

Star

(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"

tracking machine-learning real-time computer-vision robotics evaluation evaluation-metrics multi-object-tracking kitti 3d-tracking 3d-multi-object-tracking 2d-mot-evaluation 3d-mot 3d-multi kitti-3d

Updated Apr 3, 2024
Python

TCExam is a CBA (Computer-Based Assessment) system (e-exam, CBT - Computer Based Testing) for universities, schools and companies, that enables educators and trainers to author, schedule, deliver, and report on surveys, quizzes, tests and exams.

testing school university evaluation exam cba essay computer-based-assessment cbt multiple-choice mcsa computer-based-testing e-exam tcexam mcma

Updated Apr 20, 2024
PHP

open-compass / opencompass

Star

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark evaluation openai llm chatgpt large-language-model llama2 llama3

Updated Jul 5, 2024
Python

ContinualAI / avalanche

Star

Avalanche: an End-to-End Library for Continual Learning based on PyTorch.

training library framework deep-learning metrics evaluation pytorch benchmarks strategies lifelong-learning continual-learning continualai

Updated Jun 21, 2024
Python

sdiehl / write-you-a-haskell

Star

Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)

compiler functional-programming book lambda-calculus evaluation type-theory type pdf-book type-checking haskel type-system functional-language hindley-milner type-inference intermediate-representation

Updated Jan 11, 2021
Haskell

google / fuzzbench

Star

FuzzBench - Fuzzer benchmarking as a service.

security benchmarking evaluation fuzzing benchmark-framework

Updated Jul 8, 2024
Python

promptfoo / promptfoo

Star

Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Jul 9, 2024
TypeScript

huggingface / evaluate

Star

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

machine-learning evaluation

Updated Jul 3, 2024
Python

Maluuba / nlg-eval

Star

Evaluation code for various unsupervised automated metrics for Natural Language Generation.

nlp natural-language-processing meteor machine-translation dialogue evaluation dialog rouge natural-language-generation nlg cider rouge-l skip-thoughts skip-thought-vectors bleu-score bleu task-oriented-dialogue

Updated Jun 24, 2024
Python

tatsu-lab / alpaca_eval

Star

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

nlp deep-learning leaderboard evaluation instruction-following foundation-models large-language-models rlhf

Updated Jul 2, 2024
Jupyter Notebook

zenogantner / MyMediaLite

Star

recommender system library for the CLR (.NET)

evaluation collaborative-filtering matrix-factorization recommender-systems rating-prediction item-prediction

Updated Apr 30, 2020
C#

PRBonn / semantic-kitti-api

Star

SemanticKITTI API for visualizing dataset, processing data, and evaluating results.

machine-learning deep-learning evaluation labels dataset semantic-segmentation semantic-scene-completion large-scale-dataset

Updated May 14, 2024
Python

microsoft / llmops-promptflow-template

Star

LLMOps with Prompt Flow is a "LLMOps template and guidance" to help you build LLM-infused apps using Prompt Flow. It offers a range of features including Centralized Code Hosting, Lifecycle Management, Variant and Hyperparameter Experimentation, A/B Deployment, reporting for all runs and experiments and so on.

Updated Jul 1, 2024
Python

microsoft / promptbench

Star

A unified evaluation framework for large language models

benchmark evaluation prompt robustness adversarial-attacks large-language-models prompt-engineering chatgpt

Updated Jun 29, 2024
Python

Improve this page

Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation

Here are 1,148 public repositories matching this topic...

mrgloom / awesome-semantic-segmentation

Cloud-CV / EvalAI

MichaelGrupp / evo

zzw922cn / Automatic_Speech_Recognition

Knetic / govaluate

langfuse / langfuse

xinshuoweng / AB3DMOT

tecnickcom / tcexam

open-compass / opencompass

ContinualAI / avalanche

sdiehl / write-you-a-haskell

google / fuzzbench

promptfoo / promptfoo

huggingface / evaluate

Maluuba / nlg-eval

tatsu-lab / alpaca_eval

zenogantner / MyMediaLite

PRBonn / semantic-kitti-api

microsoft / llmops-promptflow-template

microsoft / promptbench

Improve this page

Add this topic to your repo