Winning model of DCASE Challenge 2023 Task 6A, with the follow-up publication:
- Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation
Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, François Germain, Jonathan Le Roux, and Shinji Watanabe
Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) 2024
[arXiv page] [DCASE results] - BibTex citation
@inproceedings{wu2024improving, title={Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation}, author={Wu, Shih-Lun and Chang, Xuankai and Wichern, Gordon and Jung, Jee-weon and Germain, Fran{\c{c}}ois and Le Roux, Jonathan and Watanabe, Shinji}, booktitle={Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP)}, year={2024} }
- (Recommended) Create Conda environment with Python 3.9
- Install PyTorch with the correct CUDA version
- Install dependencies for SPICE metric
cd caption_evaluation_tools/coco_caption bash get_stanford_models.sh cd ../../
- Install other dependencies
pip install -r requirements.txt
- Install p7zip (required for unpacking dataset)
# if using conda conda install bioconda::p7zip # if installing to system # sudo apt-get install p7zip-full
- Download Clotho dataset
bash download_clotho.sh
- Install Git-LFS
# if using conda conda install conda-forge::git-lfs git-lfs install # if installing to system # curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash # sudo apt-get install git-lfs # git-lfs install
- Get pretrained model (stored on HuggingFace)
bash download_model.sh
- Run inference & evaluation code
bash run_sampling_reranking.sh
- metrics can then be found at
./exp/inference_evaluation_nucleus_t0.5_p95/inference_metrics.json
- metrics can then be found at
Our 50K mix-up caption augmentations generated by ChatGPT (see paper Section 2.3 for details) can be found at:
Our model/repository would not have been possible without the following great open-source works. Thank you so much!
- Clotho dataset: https://zenodo.org/records/4783391
- BEATs audio encoder: https://github.com/microsoft/unilm/tree/master/beats
- INSTRUCTOR LM embeddings: https://github.com/xlang-ai/instructor-embedding
- Evaluation tools
coco-caption
: https://github.com/tylin/coco-captioncaption-evaluation-tools
: https://github.com/audio-captioning/caption-evaluation-toolsfense
: https://github.com/felixgontier/dcase-2023-baseline/tree/main/fense