Replacing weight with multiplier #105

liPatrick · 2024-09-10T00:18:39Z

To make running in epoch mode easier, we redefine weights to effectively be

the number of times that dataset will be copied when weight > 1
fraction of the dataset to be used if the weight < 1.
By default, the weight is 1 for all datasets.

In the interleave class, I redefine the probability of select the next sample using the dataset length and weight to reflect the above (only in epoch mode)

liPatrick · 2024-09-10T00:23:09Z

TODO: verify num_epochs # of steps matches what we expect.

liPatrick · 2024-09-10T23:41:00Z

TODO: verify num_epochs # of steps matches what we expect.

set to 1 epoch with 3000 total samples.
batch size of 24, with 8 gpus.
We expect 16 steps because 3000 / (8 * 24))
https://wandb.ai/fixie/ultravox/runs/widpk5aw/logs

ultravox/training/configs/meta_config.yaml

ultravox/data/datasets.py

ultravox/training/configs/meta_config.yaml

ultravox/training/train.py

ultravox/data/datasets.py

ultravox/data/dataset_config.py

liPatrick · 2024-09-12T16:46:28Z

With this change, we can no longer use non generic datasets (with or without epoch), because the multiplier requires a length associated with the dataset, which only generic datasets allow you to configure. I'll keep this on the sideline until #111 is merged in.

juberti · 2024-09-12T23:11:38Z

ultravox/data/datasets.py

@@ -1142,10 +1143,12 @@ def __init__(
        self._static = static

        self._stop_strategy = stop_strategy
+        relative_frequencies = [


I think life might be easier if we didn't try to read the multiplier value out of each dataset class, and instead read it from a config passed to the interleave class, similar to how this is done with probabilities in HF's interleave_datasets: https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.interleave_datasets

Then we wouldn't have to worry about the interaction with len, etc

Then we'd need to calculate the probabilities by hand in the config? I think we'd need len either way because that's how hf determines the # of steps to take in epoch mode.

Well, the weights are being set by hand already, right? It just seems a bit strange to have to read the weight/multiplier property from the dataset class, especially when the dataset class doesn't use it internally.

This reverts commit c3cfab6.

This reverts commit 2a1bcc5.

First

4b99f58

liPatrick added 3 commits September 10, 2024 14:44

Fix interleave dataset len

7e83d32

Fixed length in voice datasets

b83b8c9

Don't assert length for val sets

4c1ffe6

liPatrick requested review from farzadab, zqhuang211 and juberti September 10, 2024 23:45

Remove gigapseech from meta_config

5f31d1c

liPatrick commented Sep 10, 2024

View reviewed changes

ultravox/training/configs/meta_config.yaml Show resolved Hide resolved

farzadab reviewed Sep 11, 2024

View reviewed changes

liPatrick added 2 commits September 11, 2024 15:28

Addressing comments

8697dcc

Replace weight with multiplier

9deaa7f

liPatrick changed the title ~~Weights as dataset multiplier~~ Replacing weight with multiplier Sep 12, 2024

juberti reviewed Sep 12, 2024

View reviewed changes

liPatrick added 5 commits September 20, 2024 01:21

Add interleave config

2a1bcc5

Update

c3cfab6

Merge branch 'main' into patrick/weights-as-dataset-multiplier

f2b4277

Revert "Update"

d5ee8f3

This reverts commit c3cfab6.

Revert "Add interleave config"

5ab2330

This reverts commit 2a1bcc5.

juberti mentioned this pull request Oct 18, 2024

Switch InterleaveDataset to use weights (e.g., 2.0, 0.5, etc) #140

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replacing weight with multiplier #105

Replacing weight with multiplier #105

liPatrick commented Sep 10, 2024 •

edited

Loading

liPatrick commented Sep 10, 2024 •

edited

Loading

liPatrick commented Sep 10, 2024 •

edited

Loading

liPatrick commented Sep 12, 2024

juberti Sep 12, 2024

juberti Sep 12, 2024

liPatrick Sep 13, 2024

juberti Sep 18, 2024

Replacing weight with multiplier #105

Are you sure you want to change the base?

Replacing weight with multiplier #105

Conversation

liPatrick commented Sep 10, 2024 • edited Loading

liPatrick commented Sep 10, 2024 • edited Loading

liPatrick commented Sep 10, 2024 • edited Loading

liPatrick commented Sep 12, 2024

juberti Sep 12, 2024

Choose a reason for hiding this comment

juberti Sep 12, 2024

Choose a reason for hiding this comment

liPatrick Sep 13, 2024

Choose a reason for hiding this comment

juberti Sep 18, 2024

Choose a reason for hiding this comment

liPatrick commented Sep 10, 2024 •

edited

Loading

liPatrick commented Sep 10, 2024 •

edited

Loading

liPatrick commented Sep 10, 2024 •

edited

Loading