From 2ed6b07ed76e6eafb2ddc294c47446e7660f6b0e Mon Sep 17 00:00:00 2001 From: Alain Riou Date: Tue, 3 Oct 2023 17:06:18 +0200 Subject: [PATCH 1/4] update readme --- README.md | 51 ++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 46 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 68a6605..f784f9c 100644 --- a/README.md +++ b/README.md @@ -142,19 +142,60 @@ for x, sr in ...: Note that when passing a list of files to `pesto.predict_from_files(...)` or the CLI directly, the model is loaded only once so you don't have to bother with that in general. -## Benchmark +## Performances On [MIR-1K]() and [MDB-stem-synth](), PESTO outperforms other self-supervised baselines. -Its performances are close to CREPE's ones, that has 800x more parameters and was trained in a supervised way on a huge dataset containing MIR-1K and MDB-stem-synth, among others. +Its performances are close to CREPE's ones, that has 800x more parameters and was trained in a supervised way on a huge +dataset containing MIR-1K and MDB-stem-synth, among others.

-## Speed +## Speed benchmark -TODO +PESTO is a very lightweight model, and is therefore very fast at inference time. +As CQT frames are processed independently, the actual speed of the pitch estimation process mostly depends on the +granularity of the predictions, that can be controlled with the `--step_size` parameter (10ms by default). + +Here is a comparison speed between CREPE and PESTO, averaged over 10 runs on the same machine. + +- Audio file: `wav` format, 2m51s +- Hardware: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz, 8 cores + +Note that the *y*-axis is in log-scale: with a step size of 10ms (the default), +PESTO would perform pitch estimation of the file in 13 seconds (~12 times faster than real-time) while CREPE would take 12 minutes! +It is therefore more suited to applications that need very fast pitch estimation without relying on GPU resources. ## Cite -If you want to cite this work, \ No newline at end of file +If you want to use this work, please cite: +``` +TODO +``` + +## Credits + +- [multipitch-architectures](https://github.com/christofw/multipitch_architectures) for the original architecture of the model +- [nnAudio](https://github.com/KinWaiCheuk/nnAudio) for the original CQT implementation + +``` +@ARTICLE{9174990, + author={K. W. {Cheuk} and H. {Anderson} and K. {Agres} and D. {Herremans}}, + journal={IEEE Access}, + title={nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks}, + year={2020}, + volume={8}, + number={}, + pages={161981-162003}, + doi={10.1109/ACCESS.2020.3019084}} +@ARTICLE{9174990, + author={K. W. {Cheuk} and H. {Anderson} and K. {Agres} and D. {Herremans}}, + journal={IEEE Access}, + title={nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks}, + year={2020}, + volume={8}, + number={}, + pages={161981-162003}, + doi={10.1109/ACCESS.2020.3019084}} +``` \ No newline at end of file From 80207eec8364b1763c103ccf5e13de1608c28627 Mon Sep 17 00:00:00 2001 From: Alain Riou Date: Tue, 3 Oct 2023 17:15:42 +0200 Subject: [PATCH 2/4] update readme --- README.md | 39 ++++++++++++++++++++++----------------- 1 file changed, 22 insertions(+), 17 deletions(-) diff --git a/README.md b/README.md index f784f9c..cf4a42f 100644 --- a/README.md +++ b/README.md @@ -2,26 +2,25 @@ **tl;dr**: Fast and powerful pitch estimator based on machine learning +This code is the implementation of the [PESTO paper](https://arxiv.org/abs/2309.02265), +that has been accepted at [ISMIR 2023](https://ismir2023.ismir.net/). + **Disclaimer:** This repository contains minimal code and should be used for inference only. If you want full implementation details or want to use PESTO for research purposes, take a look at ~~[this repository](https://github.com/aRI0U/pesto-full)~~ (work in progress). - ## Installation ```shell pip install pesto ``` - -### Common issues - -- When +That's it! ### Dependencies This repository is implemented in [PyTorch](https://pytorch.org/) and has the following additional dependencies: -- `matplotlib` and `numpy` for basic I/O and plotting operations +- `numpy` for basic I/O operations - [torchaudio](https://pytorch.org/audio/stable/) for audio loading -- [nnAudio](https://github.com/KinWaiCheuk/nnAudio) for computing the Constant-Q Transform (CQT) +- `matplotlib` for exporting pitch predictions as images (optional) ## Usage @@ -171,13 +170,19 @@ It is therefore more suited to applications that need very fast pitch estimation If you want to use this work, please cite: ``` -TODO +@inproceedings{PESTO, + author = {Riou, Alain and Lattner, Stefan and Hadjeres, Gaëtan and Peeters, Geoffroy}, + booktitle = {Proceedings of the 24th International Society for Music Information Retrieval Conference, ISMIR 2023}, + publisher = {International Society for Music Information Retrieval}, + title = {PESTO: Pitch Estimation with Self-supervised Transposition-equivariant Objective}, + year = {2023} +} ``` ## Credits -- [multipitch-architectures](https://github.com/christofw/multipitch_architectures) for the original architecture of the model - [nnAudio](https://github.com/KinWaiCheuk/nnAudio) for the original CQT implementation +- [multipitch-architectures](https://github.com/christofw/multipitch_architectures) for the original architecture of the model ``` @ARTICLE{9174990, @@ -189,13 +194,13 @@ TODO number={}, pages={161981-162003}, doi={10.1109/ACCESS.2020.3019084}} -@ARTICLE{9174990, - author={K. W. {Cheuk} and H. {Anderson} and K. {Agres} and D. {Herremans}}, - journal={IEEE Access}, - title={nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks}, - year={2020}, - volume={8}, +@ARTICLE{9865174, + author={Weiß, Christof and Peeters, Geoffroy}, + journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, + title={Comparing Deep Models and Evaluation Strategies for Multi-Pitch Estimation in Music Recordings}, + year={2022}, + volume={30}, number={}, - pages={161981-162003}, - doi={10.1109/ACCESS.2020.3019084}} + pages={2814-2827}, + doi={10.1109/TASLP.2022.3200547}} ``` \ No newline at end of file From 3540b45d69144a1d76a10bfdd23ae185deca8ad7 Mon Sep 17 00:00:00 2001 From: Alain Riou <36546630+aRI0U@users.noreply.github.com> Date: Tue, 3 Oct 2023 17:59:26 +0200 Subject: [PATCH 3/4] add images to README.md --- README.md | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index cf4a42f..4fb72fa 100644 --- a/README.md +++ b/README.md @@ -58,9 +58,8 @@ This structure is voluntarily the same as in [CREPE](https://github.com/marl/cre Alternatively, one can choose to save timesteps, pitch, confidence and activation outputs as a `.npz` file. Finally, you can also visualize the pitch predictions by exporting them as a `png` file. Here is an example: -

- -

+![example f0](https://github.com/SonyCSLParis/pesto/assets/36546630/2ad82c86-136a-4125-bf47-ea1b93408022) + Multiple formats can be specified after the `-e` option. #### Batch processing @@ -82,9 +81,7 @@ Additionally, audio files can have any sampling rate, no resampling is required. By default, the model returns a probability distribution over all pitch bins. To convert it to a proper pitch, by default we use Argmax-Local Weighted Averaging as in CREPE: -

- -

+![image](https://github.com/SonyCSLParis/pesto/assets/36546630/38a6f405-f591-4960-81d3-6fcc551d91e8) Alternatively, one can use basic argmax of weighted average with option `-r`/`--reduction`. @@ -147,9 +144,8 @@ On [MIR-1K]() and [MDB-stem-synth](), PESTO outperforms other self-supervised ba Its performances are close to CREPE's ones, that has 800x more parameters and was trained in a supervised way on a huge dataset containing MIR-1K and MDB-stem-synth, among others. -

- -

+![image](https://github.com/SonyCSLParis/pesto/assets/36546630/2fd0e46a-f9ac-4a7e-beb7-95b6f8f030fb) + ## Speed benchmark @@ -158,6 +154,7 @@ As CQT frames are processed independently, the actual speed of the pitch estimat granularity of the predictions, that can be controlled with the `--step_size` parameter (10ms by default). Here is a comparison speed between CREPE and PESTO, averaged over 10 runs on the same machine. +![speed](https://github.com/SonyCSLParis/pesto/assets/36546630/c5ca72be-1c8a-4cbd-bc96-80fbe0d1096f) - Audio file: `wav` format, 2m51s - Hardware: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz, 8 cores @@ -203,4 +200,4 @@ If you want to use this work, please cite: number={}, pages={2814-2827}, doi={10.1109/TASLP.2022.3200547}} -``` \ No newline at end of file +``` From e36cce3baab0d1a955d0e4151eedff6276a0ca9b Mon Sep 17 00:00:00 2001 From: Alain Riou <36546630+aRI0U@users.noreply.github.com> Date: Tue, 3 Oct 2023 18:01:10 +0200 Subject: [PATCH 4/4] Update README.md --- README.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/README.md b/README.md index 4fb72fa..31f04b0 100644 --- a/README.md +++ b/README.md @@ -80,8 +80,7 @@ Additionally, audio files can have any sampling rate, no resampling is required. By default, the model returns a probability distribution over all pitch bins. To convert it to a proper pitch, by default we use Argmax-Local Weighted Averaging as in CREPE: - -![image](https://github.com/SonyCSLParis/pesto/assets/36546630/38a6f405-f591-4960-81d3-6fcc551d91e8) +![image](https://github.com/SonyCSLParis/pesto/assets/36546630/7d06bf85-585c-401f-a3c2-f2fab90dd1a7) Alternatively, one can use basic argmax of weighted average with option `-r`/`--reduction`.