Add adapter for HiSanta data #47

mdepinet · 2024-06-26T00:23:58Z

No description provided.

mdepinet · 2024-06-26T00:27:00Z

@farzadab It wasn't clear to me whether VoiceDatasetArgs are optional customizations to be used by some datasets or whether there are some that Datasets are required to respect. (I imagine at least max_audio_duration_secs is required?) Should be pretty easy to add support for the required ones once I know which those are.

Note to self: Need to set up new service account.

farzadab · 2024-06-26T16:20:15Z

I believe include_audio, shuffle, max_audio_duration_secs, and split should be respected. The other args can be situational.

farzadab · 2024-06-26T16:23:03Z

ultravox/data/datasets.py

+        """List of references to conversation metadata JSON files in the bucket.
+        These all look like {conversation_id}/metadata.json."""


Nit: Why use strings for comments?

farzadab · 2024-06-26T16:24:40Z

ultravox/data/datasets.py

+                    f"{conversation_id}/{message['speech']}"
+                ).download_as_bytes()
+                yield VoiceSample(
+                    messages=[*history, {"role": "user", "content": "<|audio|>"}],


Current assumption is that the last message should be the assistant message.

farzadab · 2024-06-26T16:29:35Z

ultravox/data/datasets.py

+        for i in range(start, len(self._conversations), increment):
+            yield from self._from_conversation(i)


We'll probably have to experiment with how to form our samples here.
There are multiple issues to consider:

How to do shuffle

The length of each sample should be regulated: max_audio_duration_secs was an attempt at this, but generally the bottleneck is GPU memory

mdepinet · 2024-06-26T18:05:36Z

Putting this on ice for now.

initial impl

ab57304

mdepinet self-assigned this Jun 26, 2024

farzadab reviewed Jun 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add adapter for HiSanta data #47

Add adapter for HiSanta data #47

mdepinet commented Jun 26, 2024

mdepinet commented Jun 26, 2024

farzadab commented Jun 26, 2024

farzadab Jun 26, 2024

farzadab Jun 26, 2024

farzadab Jun 26, 2024

mdepinet commented Jun 26, 2024

		"""List of references to conversation metadata JSON files in the bucket.
		These all look like {conversation_id}/metadata.json."""

		for i in range(start, len(self._conversations), increment):
		yield from self._from_conversation(i)

Add adapter for HiSanta data #47

Are you sure you want to change the base?

Add adapter for HiSanta data #47

Conversation

mdepinet commented Jun 26, 2024

mdepinet commented Jun 26, 2024

farzadab commented Jun 26, 2024

farzadab Jun 26, 2024

Choose a reason for hiding this comment

farzadab Jun 26, 2024

Choose a reason for hiding this comment

farzadab Jun 26, 2024

Choose a reason for hiding this comment

mdepinet commented Jun 26, 2024