Hongwei Niu1, Jie Hu2, Jianghang Lin1, Guannan Jiang3, Shengchuan Zhang1
1Xiamen University, 2National University of Singapore, 3Contemporary Amperex Technology Co., Limited (CATL)
Open-vocabulary panoptic segmentation aims to segment and classify everything in diverse scenes across an unbounded vocabulary. Existing methods typically employ two-stage or single-stage framework. The two-stage framework involves cropping the image multiple times using masks generated by a mask generator, followed by feature extraction, while the single-stage framework relies on a heavyweight mask decoder to make up for the lack of spatial position information through self-attention and cross-attention in multiple stacked Transformer blocks. Both methods incur substantial computational overhead, thereby hindering the efficiency of model inference. To fill the gap in efficiency, we propose EOV-Seg, a novel single-stage, shared, efficient, and spatial-aware framework designed for open-vocabulary panoptic segmentation. Specifically, EOV-Seg innovates in two aspects. First, a Vocabulary-Aware Selection (VAS) module is proposed to improve the semantic comprehension of visual aggregated features and alleviate the feature interaction burden on the mask decoder. Second, we introduce a Two-way Dynamic Embedding Experts (TDEE), which efficiently utilizes the spatial awareness capabilities of ViT-based CLIP backbone. To the best of our knowledge, EOV-Seg is the first open-vocabulary panoptic segmentation framework towards efficiency, which runs faster and achieves competitive performance compared with state-of-the-art methods. Specifically, with COCO training only, EOV-Seg achieves 24.5 PQ, 32.1 mIoU, and 11.6 FPS on the ADE20K dataset and the inference time of EOV-Seg is 4-19 times faster than state-of-the-art methods. Especially, equipped with ResNet50 backbone, EOV-Seg runs 23.8 FPS with only 71M parameters on a single RTX 3090 GPU.
conda create --name eov-seg python=3.8 -y
conda activate eov-seg
pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
pip install -U opencv-python
git clone [email protected]:facebookresearch/detectron2.git
python -m pip install -e detectron2
pip install git+https://github.com/cocodataset/panopticapi.git
pip install git+https://github.com/mcordts/cityscapesScripts.git
git clone https://github.com/nhw649/EOV-Seg.git
cd EOV-Seg
pip install -r requirements.txt
Name | Backbone | PQ | SQ | RQ | AP | mIoU | FPS | Params | Download |
---|---|---|---|---|---|---|---|---|---|
EOV-Seg (S) | ResNet50 | 15.1 | 57.0 | 18.9 | 7.2 | 21.9 | 23.8 | 71M | ckpt |
EOV-Seg (M) | ResNet50x4 | 18.7 | 63.5 | 23.2 | 8.5 | 25.5 | 18.4 | 127M | ckpt |
EOV-Seg (L) | ConvNeXt-L | 24.5 | 70.2 | 30.1 | 13.7 | 32.1 | 11.6 | 225M | ckpt |
Name | Backbone | A-847 | PC-459 | A-150 | PC-59 | PAS-20 | FPS | Download |
---|---|---|---|---|---|---|---|---|
EOV-Seg (S) | ResNet50 | 6.6 | 11.5 | 21.9 | 46.0 | 87.2 | 24.5 | ckpt |
EOV-Seg (M) | ResNet50x4 | 7.8 | 12.2 | 25.5 | 51.8 | 91.2 | 18.9 | ckpt |
EOV-Seg (L) | ConvNeXt-L | 12.8 | 16.8 | 32.1 | 56.9 | 94.8 | 11.8 | ckpt |
- Please follow this to prepare datasets for training. The data should be organized like:
datasets/
coco/
annotations/
{train, val}2017/
panoptic_{train, val}2017/
panoptic_semseg_{train, val}2017/
stuffthingmaps_detectron2/
ADEChallengeData2016/
images/
annotations/
annotations_instance/
annotations_detectron2/
ade20k_panoptic_{train, val}/
ade20k_panoptic_{train,val}.json
ade20k_instance_{train,val}.json
ADE20K_2021_17_01/
images/
images_detectron2/
annotations_detectron2/
VOCdevkit/
VOC2012/
Annotations/
JPEGImages/
ImageSets/
Segmentation/
VOC2010/
JPEGImages/
trainval/
trainval_merged.json
pascal_voc_d2/
images/
annotations_pascal21/
annotations_pascal20/
pascal_ctx_d2/
images/
annotations_ctx59/
annotations_ctx459/
# For ConvNeXt-Large variant
python train_net.py --num-gpus 4 --config-file configs/eov_seg/eov_seg_convnext_l.yaml
# For ResNet-50x4 variant
python train_net.py --num-gpus 4 --config-file configs/eov_seg/eov_seg_r50x4.yaml
# For ResNet-50 variant
python train_net.py --num-gpus 4 --config-file configs/eov_seg/eov_seg_r50.yaml
# For ConvNeXt-Large variant
python train_net.py --config-file configs/eov_seg/eov_seg_convnext_l.yaml --eval-only MODEL.WEIGHTS /path/to/checkpoint_file
# For ResNet-50x4 variant
python train_net.py --config-file configs/eov_seg/eov_seg_r50x4.yaml --eval-only MODEL.WEIGHTS /path/to/checkpoint_file
# For ResNet-50 variant
python train_net.py --config-file configs/eov_seg/eov_seg_r50.yaml --eval-only MODEL.WEIGHTS /path/to/checkpoint_file
python demo/demo.py --config-file configs/gtav/scsd_R50_bs2_20k.yaml \
--input input_dir/ \
--output output_dir/ \
--opts MODEL.WEIGHTS /path/to/checkpoint_file
@article{niu2024eov,
title={EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation},
author={Niu, Hongwei and Hu, Jie and Lin, Jianghang and Zhang, Shengchuan},
journal={arXiv preprint arXiv:2412.08628},
year={2024}
}
EOV-Seg is released under the Apache 2.0 license. Please refer to LICENSE for the careful check, if you are using our code for commercial matters.