Pitch transformation #2

pcournut · 2024-01-05T10:37:57Z

Following a question I received through an email, here is a small script to control pitch. It can be placed and ran wherever in the working directory and assumes you downloaded models using the src/utilities/download_checkpoints.py script.

Pitch is extracted from audio using the analyzer and can then be modified before being fed to the synthesizer. Here 2 options are explored, pitching down and monotonic pitch, but other manipulations can be thought of. One could for instance use a transcriber to get precise timestamp for each word and then apply a specific pitch at each location to totally remodel the pitch sequence.

import os
import torch
import torchaudio
import pyrootutils
import torchaudio.transforms as T
from IPython.display import Audio, display


root = pyrootutils.setup_root(__file__, dotenv=True, pythonpath=True, cwd=False)

import src.dataclasses.backbone
from src.inference.backbone import BackboneInferencer


def synth(pitch, source_features, inferencer):
    with torch.no_grad():
        noise = None
        _, f0_corrected_synth = inferencer.generator.synthesize(
            pitch,
            source_features["p_amp"],
            source_features["ap_amp"],
            source_features["linguistic"],
            source_features.get("timbre_global", None),
            source_features.get("timbre_bank", None),
            noise=noise,
        )
        synth_audio = f0_corrected_synth.detach().cpu().squeeze(0).numpy()
        display(Audio(synth_audio, rate=inferencer.output_sr))
    return synth_audio


# Load inferencer
exp_dir = os.path.join(root, "static/runs/runs_backbone/hifitts/2023-09-29_16-22-28")
checkpoint_name = "opt-steps=step=400000.ckpt"
device = "cuda:0"
inferencer = BackboneInferencer(
    exp_dir=exp_dir, checkpoint_name=checkpoint_name, device=device
)

# Load audio
audio_path = os.path.join(root, "static/samples/vctk/p225_001.wav")
audio, source_sr = torchaudio.load(audio_path)
audio = T.Resample(source_sr, inferencer.input_sr)(audio)
print("Original:")
display(Audio(audio, rate=source_sr))

# Extract pitch
with torch.no_grad():
    source_features = inferencer.generator.analyze(
        audio.to(device), enable_information_perturbator=False
    )


pitch_down = source_features["pitch"] - 20
monotonic_pitch = (
    torch.ones_like(source_features["pitch"]) * source_features["pitch"][:, 0] - 10
)
print("Pitch down:")
_ = synth(pitch_down, source_features, inferencer)
print("Monotonic pitch:")
_ = synth(monotonic_pitch, source_features, inferencer)

The text was updated successfully, but these errors were encountered:

pcournut self-assigned this Jan 5, 2024

pcournut closed this as completed Jan 5, 2024

pcournut reopened this Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pitch transformation #2

Pitch transformation #2

pcournut commented Jan 5, 2024

Pitch transformation #2

Pitch transformation #2

Comments

pcournut commented Jan 5, 2024