Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] rehearsal example clarification regarding train/valid splits #289

Open
kirk86 opened this issue Nov 14, 2023 · 2 comments
Open

Comments

@kirk86
Copy link

kirk86 commented Nov 14, 2023

In the rehearsal example there's the following template:

scenario = ClassIncremental(
    CIFAR100(data_path="my/data/path", download=True, train=True),
    increment=10,
    initial_increment=50
)

memory = rehearsal.RehearsalMemory(
    memory_size=2000,
    herding_method="barycenter"
)

for task_id, taskset in enumerate(scenario):
    if task_id > 0:
        mem_x, mem_y, mem_t = memory.get()
        taskset.add_samples(mem_x, mem_y, mem_t)

    loader = DataLoader(taskset, shuffle=True)
    for epoch in range(epochs):
        for x, y, t in loader:
            # Do your training here

    # Herding based on the barycenter (as iCaRL did) needs features,
    # so we need to extract those features, but beware to use a loader
    # without shuffling.
    loader = DataLoader(taskset, shuffle=False)

    features = my_function_to_extract_features(my_model, loader)

    # Important! Draw the raw samples from `scenario[task_id]` to
    # re-generate the taskset, otherwise you'd risk sampling from both new
    # data and memory data which is probably not what you want to do.
    memory.add(*scenario[task_id].get_raw_samples(), features)

How does that change if we add train/valid splits?

for task_id, taskset in enumerate(scenario):
    if task_id > 0:
        mem_x, mem_y, mem_t = memory.get()
        taskset.add_samples(mem_x, mem_y, mem_t)

    dataset_train, dataset_val = tasks.split_train_val(taskset, val_split=0.1)
    train_loader = tud.DataLoader(dataset_train, shuffle=True)
    val_loader = tud.DataLoader(dataset_val, shuffle=True)
    
    for epoch in range(epochs):
        for x, y, t in train_loader:
            # Do your training here

    # Herding based on the barycenter (as iCaRL did) needs features,
    # so we need to extract those features, but beware to use a loader
    # without shuffling.
    unshuffled_loader = DataLoader(taskset, shuffle=False)  # --> here should it be taskset or dataset_train?

    features = my_function_to_extract_features(my_model, unshuffled_loader)

    # Important! Draw the raw samples from `scenario[task_id]` to
    # re-generate the taskset, otherwise you'd risk sampling from both new
    # data and memory data which is probably not what you want to do.
    memory.add(*scenario[task_id].get_raw_samples(), features) # --> scenario[task_id].get_raw_samples() returns all samples in the current taskset?

If unshuffled_loader uses dataset_train then len(features) $\neq$ len(scenario[task_id].get_raw_samples()[0]).
Another question, is do we need to add samples into memory buffer from both train and valid samples or just train samples? Because, in my understanding the taskset contains all samples before the train/valid split, right?

@TLESORT
Copy link
Collaborator

TLESORT commented Nov 20, 2023

Hi @kirk86 ,
Thanks for the issue.

--> here should it be taskset or dataset_train?
I would do it based on dataset_val to avoid overfitting when sampling. But it is a matter of choice I believe.

--> scenario[task_id].get_raw_samples() returns all samples in the current taskset?
Yes.

If unshuffled_loader uses dataset_train then len(features) \diff len(scenario[task_id].get_raw_samples()[0]).
True. is that a problem?

Another question, is do we need to add samples into memory buffer from both train and valid samples or just train samples?
Usually we would/should use a separate memory buffer for validation with data from all tasks seen so for or do the split taskset before adding samples and add samples only on train. In other, case some samples might be on train and later on val.

The taskset contains all the samples of the current task.

I hope those answers helps :)

@kirk86
Copy link
Author

kirk86 commented Nov 20, 2023

Hi @TLESORT,
thanks for the reply.

If unshuffled_loader uses dataset_train then len(features) \diff len(scenario[task_id].get_raw_samples()[0]).
True. is that a problem?

I think so because memory.add(arg1, arg2) throws an error when the two arguments have not equal number of sampels.

Another question, is do we need to add samples into memory buffer from both train and valid samples or just train samples?
Usually we would/should use a separate memory buffer for validation with data from all tasks seen so for or do the split taskset before adding samples and add samples only on train. In other, case some samples might be on train and later on val.

I didn't quite get the first part of using the separate memory buffer for validation, I'll add your suggested changes below to the MWE and let me know if there's a misunderstanding on my part.

for task_id, taskset in enumerate(scenario):
    dataset_train, dataset_val = tasks.split_train_val(taskset, val_split=0.1)
    train_loader = tud.DataLoader(dataset_train, shuffle=True)
    val_loader = tud.DataLoader(dataset_val, shuffle=True)

    if task_id > 0: # --> as suggested, first split taskset and then add samples to dataset_train only
        mem_x, mem_y, mem_t = memory.get()
        dataset_train.add_samples(mem_x, mem_y, mem_t)
    
    for epoch in range(epochs):
        for x, y, t in train_loader:
            # Do your training here

    # beware use a loader without shuffling.
    unshuffled_loader = DataLoader(dataset_val, shuffle=False)  # --> as suggested dataset_val to avoid overfitting?

    features = my_function_to_extract_features(my_model, unshuffled_loader)

    # Important! Draw the raw samples from `scenario[task_id]` to
    # re-generate the taskset, otherwise you'd risk sampling from both new
    # data and memory data which is probably not what you want to do.
    memory.add(*scenario[task_id].get_raw_samples(), features) # --> how should this change, is it dataset_train.get_raw_samples?

Could you also illustrate what you meant with Usually we would/should use a separate memory buffer for validation with data from all tasks seen so far, and where exactly it fits in the MWE?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants