Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Documentation Clarity and Consistency in README #23

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
## 📖 Introduction
OmAgent is a sophisticated multimodal intelligent agent system, dedicated to harnessing the power of multimodal large language models and other multimodal algorithms to accomplish intriguing tasks. The OmAgent project encompasses a lightweight intelligent agent framework, omagent_core, meticulously designed to address multimodal challenges. With this framework, we have constructed an intricate long-form video comprehension system—OmAgent. Naturally, you have the liberty to employ it to realize any of your innovative ideas.
OmAgent comprises three core components:
- **Video2RAG**: The concept behind this component is to transform the comprehension of long videos into a multimodal RAG task. The advantage of this approach is that it transcends the limitations imposed by video length; however, the downside is that such preprocessing may lead to the loss of substantial video detail.
- **Video2RAG**: The concept behind this component is to transform the comprehension of long videos into a multimodal RAG task. The advantage of this approach is that it overcomes the limitations imposed by video length; however, the downside is that such preprocessing may result in the loss of significant video detail.
- **DnCLoop**: Inspired by the classical algorithmic paradigm of Divide and Conquer, we devised a recursive general-task processing logic. This method iteratively refines complex problems into a task tree, ultimately transforming intricate tasks into a series of solvable, simpler tasks.
- **Rewinder Tool**: To address the issue of information loss in the Video2RAG process, we have designed a "progress bar" tool named Rewinder that can be autonomously used by agents. This enables the agents to revisit any video details, allowing them to seek out the necessary information.

Expand All @@ -47,7 +47,7 @@ For more details, check out our paper **[OmAgent: A Multi-modal Agent Framework
## 🚀 Quick Start

### General Task Processing
1. Create a configuration file and set some necessary variables
1. Create a configuration file and set the necessary variables.
```shell
cd workflows/general && vim config.yaml
```
Expand Down Expand Up @@ -115,7 +115,7 @@ For more details, check out our paper **[OmAgent: A Multi-modal Agent Framework
- Download some interesting videos

#### Running Preparation
1. Create a configuration file and set some necessary environment variables
1. Create a configuration file and set the required environment variables.
```shell
cd workflows/video_understanding && vim config.yaml
```
Expand Down Expand Up @@ -145,10 +145,10 @@ For more details, check out our paper **[OmAgent: A Multi-modal Agent Framework
if __name__ == "__main__":
run_agent("") # You will be prompted to enter the query in the console
```
3. Start OmAgent by running ```python run.py```. Enter the path of the video you want to process, wait a moment, then enter your query, and OmAgent will answer based on the query.
3. Start OmAgent by running ```python run.py```. Provide the path of the video you wish to process, wait briefly, then enter your query, and OmAgent will respond accordingly.

## 🔗 Related works
If you are intrigued by multimodal algorithms, large language models, and agent technologies, we invite you to delve deeper into our research endeavors:
If you are interested in multimodal algorithms, large language models, and agent technologies, we encourage you to explore our research endeavors:
🔆 [How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection](https://arxiv.org/abs/2308.13177) (AAAI24)
🏠 [GitHub Repository](https://github.com/om-ai-lab/OVDEval/tree/main)

Expand Down