OmChat: A Family of Powerful Native Multimodal Language Models

We are thrilled to announce the release of OmChat Beta 2.0, a research version of our models from Om AI. This release includes the Qwen2 7B LLM-base and the InterVIT6B vision tower-based model, combining to form the OmChat Beta 13B model. These models are now available as open-source for researchers in the multimodal field, aimed at advancing meaningful research and contributing to the AI ecosystem's progress.

In the near future, we plan to release OmChat Beta 2.1, which will include support for long context as detailed in the OmChat paper, as well as a lighter version of the model. We will continue to update our latest versions for research purposes. For performance evaluation, we have tested our models using the OpenCompass benchmarks.

Updates

09/10/2024: OmChat2.1-8B achieves top 1 on multi-image benchmark: Mantis-Eval for 8B models，which also outperformers GPT-4V. It also acheives SOTA on MMBench-Video for video benchmark. 🎉
08/10/2024: The OmChat open-source project has been unveiled. 🎉
07/06/2024: The OmChat research paper has been published.

Models & Scripts

Installation

1. Clone this repository and navigate to the OmChat folder:

git clone https://github.com/om-ai-lab/OmChat.git
cd OmChat

2. Install the inference package:

conda create -n omchat python=3.10 -y
conda activate omchat
pip install --upgrade pip  # Enable PEP 660 support.
pip install -e .
pip install flash-attn

How to test a single image

python single_inference.py --model-path path-to-omchat-model --image-path path-to-image --question question-content

Command-Line Interface

Execute conversational inference through the command line interface

python cli.py --model-path path-to-omchat-model --image-path path-to-image

Examples

User: how tall is he?
Asistant: The question is "how tall is he?". The question is asking about the height of the person in the picture. The answer is giving the height of the person in feet and inches. The person is 6 feet and 2 inches tall.
The answer is: 6'2"

An Example with Huggingface transformers

Download huggingface model

git lfs install
git clone https://huggingface.co/omlab/omchat-v2.0-13B-single-beta_hf

from transformers import AutoModel, AutoProcessor, AutoTokenizer
from PIL import Image
import requests
import torch
from transformers import TextStreamer

model = AutoModel.from_pretrained("omlab/omchat-v2.0-13B-single-beta_hf",trust_remote_code=True, torch_dtype=torch.float16).cuda().eval()
processor = AutoProcessor.from_pretrained("omlab/omchat-v2.0-13B-single-beta_hf", trust_remote_code=True)

url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt ="What's the content of the image?"
inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda")

with torch.inference_mode():
    output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False, eos_token_id=model.generation_config.eos_token_id,  pad_token_id=processor.tokenizer.pad_token_id)

outputs = processor.tokenizer.decode(output_ids[0, inputs.input_ids.shape[1] :]).strip()
print (outputs)
# The image features a stop sign in front of a Chinese archway, with a black car driving past. The stop sign is located on the left side of the scene, while the car is on the right side. There are also two statues of lions on either side of the archway, adding to the cultural ambiance of the scene.<|im_end|>

Available HF Models from Om AI

omchat-v2.0-13B-single-beta_hf Currently, it supports only single images, but we will soon release models with multi-image and video support.

Model Comparison Results on OpenCompass (less than 20B models)

Rank	Method	Avg Score	MMBench_V11	MMStar	MMMU_VAL	MathVista
1	InternVL2-8B	64.1	79.4	61.5	51.2	58.3
2	OmChat2.0-13B	62.0	79.5	58.2	49.6	57.1
3	InternLM-XComposer2.5	61.1	79.4	59.9	42.9	63.7
4	InternVL2-4B	60.6	73.6	53.9	48.3	58.1
5	GLM-4v-9B	59.1	67.9	54.8	46.9	51.1
6	InternLM-XComposer2-4	58.8	76.5	55.3	39.7	59.4
7	MiniCPM-Llama3-V2.5	58.8	72	51.8	45.8	54.3
8	WeMM	58.3	75.7	57	45.3	54.9
9	InternLM-XComposer2	57.1	77.6	56.2	41.4	59.5
10	CogVLM2-19B-Chat	56.3	70.7	50.5	42.6	38.6

Model Comparison Results on Mantis-Eval (8B models)

Models	Model size	Mantis-Eval
OmChat-2.1-8B	8B	67.28
LLaVA OneVision	7B	64.20
GPT-4V	-	62.67
Mantis-SigLIP	8B	59.45
Mantis-Idefics2	8B	57.14
Mantis-CLIP	8B	55.76
VILA	8B	51.15
Idefics2	8B	48.85

Model Comparison Results on MMBench-Video

Model	Frame	Overall Mean	Perception					Reasoning
Model	Frame	Overall Mean	CP	FP-S	FP-C	HL	Mean	LR	AR	RR	CSR	TR	Mean
GPT-4o	8	1.62	1.82	1.59	1.43	1.95	1.63	1.33	1.89	1.60	1.60	1.44	1.57
GPT-4v	8	1.53	1.68	1.45	1.43	1.79	1.51	1.14	1.81	1.70	1.59	1.39	1.52
Gemini-Pro-v1.0	8	1.49	1.72	1.50	1.28	0.79	1.49	1.02	1.66	1.58	1.59	1.40	1.45
OmChat-2.1-8B	32	1.34	1.54	1.42	1.12	0.42	1.37	1.05	1.40	1.43	1.37	1.13	1.27
Gemini-Pro-v1.5	8	1.30	1.51	1.30	0.98	2.03	1.32	1.06	1.62	1.36	1.25	0.94	1.22
InternVL-Chat-v1.5	8	1.26	1.51	1.22	1.01	1.21	1.25	0.88	1.4	1.48	1.28	1.09	1.22
Claude-3v-Opus	4	1.19	1.37	1.11	1.00	1.56	1.16	1.12	1.35	1.36	1.17	1.05	1.20
mPLUG-Owl2	8	1.15	1.34	1.18	0.99	0.27	1.15	0.63	1.33	1.30	1.03	1.11	1.11
VideoStreaming	64	1.12	1.38	1.13	0.8	0.32	1.13	0.77	1.27	1.11	1.01	1.1	1.09
idefics2-8B	8	1.10	1.23	1.07	0.89	0.77	1.06	0.77	1.27	1.41	1.11	1.14	1.16
Video-LLaVA	8	1.05	1.14	1.08	0.88	0.50	1.04	0.72	1.23	1.03	0.89	0.97	0.99

Citation

If you find our repository beneficial, please cite our paper:

@article{zhao2024omchat,
  title={OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding},
  author={Zhao, Tiancheng and Zhang, Qianqian and Lee, Kyusong and Liu, Peng and Zhang, Lu and Fang, Chunxin and Liao, Jiajia and Jiang, Kelei and Ma, Yibo and Xu, Ruochen},
  journal={arXiv preprint arXiv:2407.04923},
  year={2024}
}

Acknowledgement

The codebase and models are built upon the following projects:

Projects from Om AI Team

If you are intrigued by multimodal algorithms, large language models, and agent technologies, we invite you to delve deeper into our research endeavors:
🔆 OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer 🏠 Github Repository

🔆 How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection(AAAI24)
🏠 Github Repository

🔆 OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network(IET Computer Vision)
🏠 Github Repository

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
images		images
omchat		omchat
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py
convert_omchat_to_hf.py		convert_omchat_to_hf.py
eval_q.sh		eval_q.sh
hf_example.py		hf_example.py
pyproject.toml		pyproject.toml
single_inference.py		single_inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OmChat: A Family of Powerful Native Multimodal Language Models

Updates

Models & Scripts

Installation

1. Clone this repository and navigate to the OmChat folder:

2. Install the inference package:

How to test a single image

Command-Line Interface

An Example with Huggingface transformers

Available HF Models from Om AI

Model Comparison Results on OpenCompass (less than 20B models)

Citation

Acknowledgement

Projects from Om AI Team

About

Releases

Packages

Contributors 4

Languages

License

om-ai-lab/OmChat

Folders and files

Latest commit

History

Repository files navigation

OmChat: A Family of Powerful Native Multimodal Language Models

Updates

Models & Scripts

Installation

1. Clone this repository and navigate to the OmChat folder:

2. Install the inference package:

How to test a single image

Command-Line Interface

An Example with Huggingface transformers

Available HF Models from Om AI

Model Comparison Results on OpenCompass (less than 20B models)

Citation

Acknowledgement

Projects from Om AI Team

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages