Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacing weight with multiplier #105

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

liPatrick
Copy link
Contributor

@liPatrick liPatrick commented Sep 10, 2024

To make running in epoch mode easier, we redefine weights to effectively be

  1. the number of times that dataset will be copied when weight > 1
  2. fraction of the dataset to be used if the weight < 1.
  3. By default, the weight is 1 for all datasets.

In the interleave class, I redefine the probability of select the next sample using the dataset length and weight to reflect the above (only in epoch mode)

@liPatrick
Copy link
Contributor Author

liPatrick commented Sep 10, 2024

TODO: verify num_epochs # of steps matches what we expect.

@liPatrick
Copy link
Contributor Author

liPatrick commented Sep 10, 2024

TODO: verify num_epochs # of steps matches what we expect.

set to 1 epoch with 3000 total samples.
batch size of 24, with 8 gpus.
We expect 16 steps because 3000 / (8 * 24))
https://wandb.ai/fixie/ultravox/runs/widpk5aw/logs

ultravox/data/datasets.py Outdated Show resolved Hide resolved
ultravox/data/datasets.py Outdated Show resolved Hide resolved
ultravox/training/configs/meta_config.yaml Show resolved Hide resolved
ultravox/training/train.py Outdated Show resolved Hide resolved
ultravox/data/datasets.py Outdated Show resolved Hide resolved
ultravox/data/dataset_config.py Outdated Show resolved Hide resolved
@liPatrick
Copy link
Contributor Author

With this change, we can no longer use non generic datasets (with or without epoch), because the multiplier requires a length associated with the dataset, which only generic datasets allow you to configure. I'll keep this on the sideline until #111 is merged in.

@liPatrick liPatrick changed the title Weights as dataset multiplier Replacing weight with multiplier Sep 12, 2024
@@ -1142,10 +1143,12 @@ def __init__(
self._static = static

self._stop_strategy = stop_strategy
relative_frequencies = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think life might be easier if we didn't try to read the multiplier value out of each dataset class, and instead read it from a config passed to the interleave class, similar to how this is done with probabilities in HF's interleave_datasets: https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.interleave_datasets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we wouldn't have to worry about the interaction with len, etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we'd need to calculate the probabilities by hand in the config? I think we'd need len either way because that's how hf determines the # of steps to take in epoch mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the weights are being set by hand already, right? It just seems a bit strange to have to read the weight/multiplier property from the dataset class, especially when the dataset class doesn't use it internally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants