LLMEval-Survey

📌 What is This Survey About?

LLMs have gotten attention across various research domains due to their exceptional performance on a wide range of complex tasks. Therefore, refined methods to evaluate the capabilities of LLMs are needed to determine the tasks and responsibility they should undertake. Our study mainly discussed how LLMs, as useful tools, should be effectively assessed. We proposed the two-stage framework: from core ability to agent, clearly explaining how LLMs can be applied based on their specific capabilities, along with the evaluation methods in each stage. Core ability refers to the capabilities that LLMs need in order to generate high-quality natural language texts. After confirming LLMs possess core ability, they can solve real-world and complex tasks as agent. In the core ability stage, we discussed the reasoning ability, societal impact, and domain knowledge of LLMs. In the agent stage, we demonstrated embodied action, planning, and tool learning of LLMs agent applications. Finally, we examined the challenges currently confronting the evaluation methods for LLMs, as well as the directions for future development.

🤖 Framework

📖 Table of Content

Introduction
- Artificial Intelligence and Large Language Model
- Why Evaluating LLMs is Important
- The Roadmap of Useful LLMs
- Study Overview
Core Ability Evaluation
- Reasoning
  - Logical Reasoning
  - Mathematical Reasoning
  - Commonsense Reasonin
  - Multi-hop Reasoning
  - Structured Data Reasoning
- Societal Impact
  - Safety
    - Content Safety
    - Security
    - Ethical Consideration
  - Truthfulness
    - Hallucination
    - Domain Knowledge
- Domain Knowledge
  - Finance
  - Legislation
  - Psychology
  - Medicine
  - Education
Agent Evaluation
- Planning
- Application Scenarios
  - Web Grounding
  - Code Generation
  - Database Queries
  - API Calls
  - Tool Creation
  - Robotic Navigation
  - Robotic Manipulation
- Benchmark
Future Directions
- Dynamic Evaluation
- LLMs as Evaluators
- Root Cause Analysis
- Fine-grained LLM Agent Evaluation
- Robot Benchmark Development
Conclusion

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
README.md		README.md
framework.pdf		framework.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMEval-Survey

📌 What is This Survey About?

🤖 Framework

📖 Table of Content

About

Releases

Packages

Contributors 4

MiuLab/LLMEval-Survey

Folders and files

Latest commit

History

Repository files navigation

LLMEval-Survey

📌 What is This Survey About?

🤖 Framework

📖 Table of Content

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Packages