Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pitch transformation #2

Open
pcournut opened this issue Jan 5, 2024 · 0 comments
Open

Pitch transformation #2

pcournut opened this issue Jan 5, 2024 · 0 comments
Assignees

Comments

@pcournut
Copy link
Collaborator

pcournut commented Jan 5, 2024

Following a question I received through an email, here is a small script to control pitch. It can be placed and ran wherever in the working directory and assumes you downloaded models using the src/utilities/download_checkpoints.py script.

Pitch is extracted from audio using the analyzer and can then be modified before being fed to the synthesizer. Here 2 options are explored, pitching down and monotonic pitch, but other manipulations can be thought of. One could for instance use a transcriber to get precise timestamp for each word and then apply a specific pitch at each location to totally remodel the pitch sequence.

import os
import torch
import torchaudio
import pyrootutils
import torchaudio.transforms as T
from IPython.display import Audio, display


root = pyrootutils.setup_root(__file__, dotenv=True, pythonpath=True, cwd=False)

import src.dataclasses.backbone
from src.inference.backbone import BackboneInferencer


def synth(pitch, source_features, inferencer):
    with torch.no_grad():
        noise = None
        _, f0_corrected_synth = inferencer.generator.synthesize(
            pitch,
            source_features["p_amp"],
            source_features["ap_amp"],
            source_features["linguistic"],
            source_features.get("timbre_global", None),
            source_features.get("timbre_bank", None),
            noise=noise,
        )
        synth_audio = f0_corrected_synth.detach().cpu().squeeze(0).numpy()
        display(Audio(synth_audio, rate=inferencer.output_sr))
    return synth_audio


# Load inferencer
exp_dir = os.path.join(root, "static/runs/runs_backbone/hifitts/2023-09-29_16-22-28")
checkpoint_name = "opt-steps=step=400000.ckpt"
device = "cuda:0"
inferencer = BackboneInferencer(
    exp_dir=exp_dir, checkpoint_name=checkpoint_name, device=device
)

# Load audio
audio_path = os.path.join(root, "static/samples/vctk/p225_001.wav")
audio, source_sr = torchaudio.load(audio_path)
audio = T.Resample(source_sr, inferencer.input_sr)(audio)
print("Original:")
display(Audio(audio, rate=source_sr))

# Extract pitch
with torch.no_grad():
    source_features = inferencer.generator.analyze(
        audio.to(device), enable_information_perturbator=False
    )


pitch_down = source_features["pitch"] - 20
monotonic_pitch = (
    torch.ones_like(source_features["pitch"]) * source_features["pitch"][:, 0] - 10
)
print("Pitch down:")
_ = synth(pitch_down, source_features, inferencer)
print("Monotonic pitch:")
_ = synth(monotonic_pitch, source_features, inferencer)
@pcournut pcournut self-assigned this Jan 5, 2024
@pcournut pcournut closed this as completed Jan 5, 2024
@pcournut pcournut reopened this Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant