Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add adapter for HiSanta data #47

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

mdepinet
Copy link

No description provided.

@mdepinet mdepinet self-assigned this Jun 26, 2024
@mdepinet
Copy link
Author

@farzadab It wasn't clear to me whether VoiceDatasetArgs are optional customizations to be used by some datasets or whether there are some that Datasets are required to respect. (I imagine at least max_audio_duration_secs is required?) Should be pretty easy to add support for the required ones once I know which those are.

Note to self: Need to set up new service account.

@farzadab
Copy link
Contributor

I believe include_audio, shuffle, max_audio_duration_secs, and split should be respected. The other args can be situational.

Comment on lines +705 to +706
"""List of references to conversation metadata JSON files in the bucket.
These all look like {conversation_id}/metadata.json."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Why use strings for comments?

f"{conversation_id}/{message['speech']}"
).download_as_bytes()
yield VoiceSample(
messages=[*history, {"role": "user", "content": "<|audio|>"}],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current assumption is that the last message should be the assistant message.

Comment on lines +719 to +720
for i in range(start, len(self._conversations), increment):
yield from self._from_conversation(i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll probably have to experiment with how to form our samples here.
There are multiple issues to consider:

  1. How to do shuffle
  2. The length of each sample should be regulated: max_audio_duration_secs was an attempt at this, but generally the bottleneck is GPU memory

@mdepinet
Copy link
Author

Putting this on ice for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants