Vague output for audio #25

lixinghe1999 · 2024-07-13T08:54:13Z

I slightly modify the eval code of audio to run on my dataset, however, the outputs are vague even the audio is speech.
There are all like the blow ones:

A device is beeping and it gets louder and louder.
A machine is running and making a high pitched sound.
A machine is running and then stops suddenly.

I attach my code below

def inference_onellm(model, target_dtype, images, modal=['image']):
    if 'imu' in modal:
        inps = ['Describe the motion.'] * len(images)
    if 'audio' in modal:
        inps = ['Provide a one-sentence caption for the provided audio.'] * len(images)
        # inps = ['Provide a one-sentence action description for the provided audio.'] * len(images)
    if 'image' in modal:
        inps = ['Describe the scene.'] * len(images)
    images = images.cuda().to(target_dtype)
    prompts = []
    for inp in inps:
        conv = conv_templates["v1"].copy()        
        conv.append_message(conv.roles[0], inp)
        conv.append_message(conv.roles[1], None)
        prompts.append(conv.get_prompt())

    with torch.cuda.amp.autocast(dtype=target_dtype):
        responses = model.generate(prompts, images, 128, temperature=0.1, top_p=0.75, modal=modal)
        outputs = []
        for response, prompt in zip(responses, prompts):
            response = response[len(prompt):].split('###')[0]
            response = response.strip()
            outputs.append(response)
    return outputs
audio = torch.tensor(make_audio_features('tmp_onellm.wav', mel_bins=128).transpose(0, 1)[None, None])
result_audio = inference_onellm(model, target_dtype, audio, modal=['audio'])

The text was updated successfully, but these errors were encountered:

csuhan · 2024-07-13T10:59:46Z

Hi @lixinghe1999 , our model is mainly trained on natural sound like bird chirping, dog barking and train passing, so it is hard to distinguish human speech. Here are two solutions to enhance it:

Stage II: Multimodal-Text alignment on speech-text data. But it requires joint training with other modalities
Add a pretrained speech encoder (e.g. Whisper) to extract speech information. You can refer to https://github.com/QwenLM/Qwen-Audio

lixinghe1999 · 2024-07-14T08:04:25Z

Thank you for your rapid reply. However, it still outputs meaningless results for other sounds, like musical instrument sounds. Can you give me some hints to solve it? I believe it is not necessary to retrain

Does it possible for the audio duration? Since the IMU duration is fixed to 2 seconds, I also fix the audio duration to 2 seconds

csuhan · 2024-07-16T10:02:50Z

It may also be related to the sampling length. We sample 1024 frames in total.

OneLLM/data/data_utils.py

Lines 81 to 86 in 913638c

    
           p = target_length - n_frames 
        
           if p > 0: 
        
               m = torch.nn.ZeroPad2d((0, 0, 0, p)) 
        
               fbank = m(fbank) 
        
           elif p < 0: 
        
               fbank = fbank[0:target_length, :]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vague output for audio #25

Vague output for audio #25

lixinghe1999 commented Jul 13, 2024

csuhan commented Jul 13, 2024

lixinghe1999 commented Jul 14, 2024 •

edited

Loading

csuhan commented Jul 16, 2024

Vague output for audio #25

Vague output for audio #25

Comments

lixinghe1999 commented Jul 13, 2024

csuhan commented Jul 13, 2024

lixinghe1999 commented Jul 14, 2024 • edited Loading

csuhan commented Jul 16, 2024

lixinghe1999 commented Jul 14, 2024 •

edited

Loading