Speechcatcher is an open source toolbox for transcribing and translating speech from media files (audio/video). Speechcatcher models are trained using whisper as teacher and offer compact and small ASR models that run fast on CPUs too:
You can find the command line interface here. It can transcribe any media file and can also be used for live transcription with your microphone. In this repository, there is also an overview of all available speechcatcher models.
Scripts to replicate the data gathering can be found in: speechcatcher-data. There also instructions on how to replicate the training procedure with espnet.
Speechcatcher also comes with an easy to use webgui. It supports multiple ASR engines: speechcatcher (CPU), subtitle2go (CPU) or whisper (GPU).
By using models that target a single language, Speechcatcher models aim to be much faster than single-model transcribe systems for multiple languages such as whisper.
See our results here.
Currently the focus is on transcribing German speech. Later, more languages might be added. If you would like to help to expand Speechcatcher, please get in touch!
If you use speechcatcher models in your research, for now just cite this repository:
@misc{milde2023speechcatcher,
author = {Milde, Benjamin},
title = {Speechcatcher},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/speechcatcher-asr/speechcatcher}},
}
Speechcatcher is gracefully funded by
Media Tech Lab by Media Lab Bayern (@media-tech-lab)