Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vague output for audio #25

Open
lixinghe1999 opened this issue Jul 13, 2024 · 3 comments
Open

Vague output for audio #25

lixinghe1999 opened this issue Jul 13, 2024 · 3 comments

Comments

@lixinghe1999
Copy link

I slightly modify the eval code of audio to run on my dataset, however, the outputs are vague even the audio is speech.
There are all like the blow ones:

  1. A device is beeping and it gets louder and louder.
  2. A machine is running and making a high pitched sound.
  3. A machine is running and then stops suddenly.

I attach my code below

def inference_onellm(model, target_dtype, images, modal=['image']):
    if 'imu' in modal:
        inps = ['Describe the motion.'] * len(images)
    if 'audio' in modal:
        inps = ['Provide a one-sentence caption for the provided audio.'] * len(images)
        # inps = ['Provide a one-sentence action description for the provided audio.'] * len(images)
    if 'image' in modal:
        inps = ['Describe the scene.'] * len(images)
    images = images.cuda().to(target_dtype)
    prompts = []
    for inp in inps:
        conv = conv_templates["v1"].copy()        
        conv.append_message(conv.roles[0], inp)
        conv.append_message(conv.roles[1], None)
        prompts.append(conv.get_prompt())

    with torch.cuda.amp.autocast(dtype=target_dtype):
        responses = model.generate(prompts, images, 128, temperature=0.1, top_p=0.75, modal=modal)
        outputs = []
        for response, prompt in zip(responses, prompts):
            response = response[len(prompt):].split('###')[0]
            response = response.strip()
            outputs.append(response)
    return outputs
audio = torch.tensor(make_audio_features('tmp_onellm.wav', mel_bins=128).transpose(0, 1)[None, None])
result_audio = inference_onellm(model, target_dtype, audio, modal=['audio'])

@csuhan
Copy link
Owner

csuhan commented Jul 13, 2024

Hi @lixinghe1999 , our model is mainly trained on natural sound like bird chirping, dog barking and train passing, so it is hard to distinguish human speech. Here are two solutions to enhance it:

  • Stage II: Multimodal-Text alignment on speech-text data. But it requires joint training with other modalities
  • Add a pretrained speech encoder (e.g. Whisper) to extract speech information. You can refer to https://github.com/QwenLM/Qwen-Audio

@lixinghe1999
Copy link
Author

lixinghe1999 commented Jul 14, 2024

Thank you for your rapid reply. However, it still outputs meaningless results for other sounds, like musical instrument sounds. Can you give me some hints to solve it? I believe it is not necessary to retrain

Does it possible for the audio duration? Since the IMU duration is fixed to 2 seconds, I also fix the audio duration to 2 seconds

@csuhan
Copy link
Owner

csuhan commented Jul 16, 2024

It may also be related to the sampling length. We sample 1024 frames in total.

p = target_length - n_frames
if p > 0:
m = torch.nn.ZeroPad2d((0, 0, 0, p))
fbank = m(fbank)
elif p < 0:
fbank = fbank[0:target_length, :]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants