create _preprare_fsdp to pre- prepare fsdp model training #3213

eljandoubi · 2024-11-03T12:30:25Z

Who can review?

Fully-Sharded Data Parallism: @muellerzr

eljandoubi · 2024-11-13T18:42:12Z

@muellerzr have you any feedback?

…fsdp-model-parms

HuggingFaceDocBuilderDev · 2024-11-21T00:30:47Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…fsdp-model-parms

BenjaminBossan

Thanks for working on this important issue.

I wonder on the overall approach to solving this issue. Right now, IIUC, the idea is to re-initialize the optimizer(s) using the FSDP-wrapped models. This could potentially be error prone and ideally should be well tested before we merge.

Would it be possible instead to check if this has happened and raise an error, then direct users towards not initializing the optimizer prematurely, and handle the initialization with accelerate so that it's in the right order? Maybe that can't work, but I think it's worth considering different solutions.

Also, in case this is really the same issue as this one reported in PEFT, it means that it used to work correctly in previous versions of accelerate/transformers. In that case, I wonder what changed that resulted in the issue.

BenjaminBossan · 2024-11-25T11:23:36Z

src/accelerate/accelerator.py

+        # Validate the presence of models and optimizers
+        if not models and not optimizers:
+            return args
+
+        # Flattening weights implies that the optimizers have already been processed.
+        if next(next(iter(models.values())).named_parameters())[0].endswith("_flat_param"):
+            return args
+
+        if len(models) != len(optimizers):
+            raise ValueError(
+                f"The number of models ({len(models)}) must be equal to the number of optimizers ({len(optimizers)})."
+            )


Would it make sense to move these checks to the very start of the method?

@BenjaminBossan The method! Do you mean .prepare? What are the benefits of doing so?

I meant _prepare_fsdp. It is a common pattern to perform all checks as early as possible, so following this makes the code easier to understand for readers. This is especially so if there are early returns.

The reason why it's common is so that we don't do any unnecessary work if the checks fail anyway. In this case, there is no need to determine models or optimizers if we're going to raise an error later. By skipping the unnecessary work, we ensure faster execution and prevent possibly unwanted side-effects (this might not be relevant here right now but code will change in the future and then it could be true).

BenjaminBossan · 2024-11-25T11:28:42Z

src/accelerate/accelerator.py

+        # Clear parameter lists.
+        for opt in optimizers.values():
+            for group in opt.param_groups:
+                group["params"] = []


Is it better to call group["params"].clear()? That would affect references.

…fsdp-model-parms

eljandoubi · 2024-11-26T00:20:40Z

@BenjaminBossan

You are right; the solution needs more thorough testing. I have conducted a small test in a multi-GPU environment, but it should be tested against other configurations and models.
Raising an error when the optimizer is initialized prematurely could be a solution, but two cases emerge:
1. If Accelerate initializes the optimizer internally, either it creates a default optimizer with limited flexibility, or the user provides optimizer specifications. This would require external special treatment for FSDP, causing Accelerate to lose its unified interface for distributed training.
2. If the user initializes the optimizer post-FSDP setup, Accelerate also loses generalization in this case.
For PEFT training, which involves mixing frozen and non-frozen parameters, use_orig_params=True must be used. This ensures non-flattened parameters, meaning no changes are needed in the optimizer (I assume).
The Transformers Trainer internally uses delay_optimizer_creation and creates the optimizer after FSDP wrapping.

BenjaminBossan

Thanks for the updates and explaining your testing and reasoning further.

Raising an error when the optimizer is initialized prematurely could be a solution, but two cases emerge

These are valid concerns. At the end of the day, it's a trade off and we need to decide which cost we'd rather pay.

The Transformers Trainer internally uses delay_optimizer_creation and creates the optimizer after FSDP wrapping.

I wonder if this logic could be moved/copied to accelerate.

For PEFT training, which involves mixing frozen and non-frozen parameters, use_orig_params=True must be used

I don't think this is strictly necessary, we have examples in PEFT with use_orig_params=False. But I have to admit I don't know what exactly changes under the hood in FSDP when this parameter is set. Note also that in the linked PEFT issue, I tried setting use_orig_params=True and it didn't help.

BenjaminBossan · 2024-11-26T15:26:51Z

src/accelerate/accelerator.py

+        # Validate the presence of models and optimizers
+        if not models and not optimizers:
+            return args
+
+        # Flattening weights implies that the optimizers have already been processed.
+        if next(next(iter(models.values())).named_parameters())[0].endswith("_flat_param"):
+            return args
+
+        if len(models) != len(optimizers):
+            raise ValueError(
+                f"The number of models ({len(models)}) must be equal to the number of optimizers ({len(optimizers)})."
+            )


I meant _prepare_fsdp. It is a common pattern to perform all checks as early as possible, so following this makes the code easier to understand for readers. This is especially so if there are early returns.

The reason why it's common is so that we don't do any unnecessary work if the checks fail anyway. In this case, there is no need to determine models or optimizers if we're going to raise an error later. By skipping the unnecessary work, we ensure faster execution and prevent possibly unwanted side-effects (this might not be relevant here right now but code will change in the future and then it could be true).

eljandoubi · 2024-11-26T18:35:02Z

The PyTorch FSDP documentation mentions that mixed frozen and non-frozen parameters are only supported when use_orig_params=True. I believe this is due to the flattening process — we cannot flatten parameters with gradients together with those without. My contribution does not handle the case where use_orig_params=True because, in that scenario, the parameters remain unchanged.
Another solution would be to adapt an approach similar to the DeepSpeed integration. This involves using a single model and optimizer, and after wrapping the model with model = FSDP(model), we create the optimizer as optimizer = optimizer_class(model.parameters(), **defaults). See Select the DeepSpeedCPUOptimizer based on the original optimizer class. #3255

BenjaminBossan · 2024-11-28T10:58:36Z

The PyTorch FSDP documentation mentions that mixed frozen and non-frozen parameters are only supported when use_orig_params=True. I believe this is due to the flattening process — we cannot flatten parameters with gradients together with those without.

This is true but using the right FSDP auto wrap policy, trainable and frozen parameters should be prevented from being mixed.

Another solution would be to adapt an approach similar to the DeepSpeed integration.

If this is an option here, it sounds like the more robust solution to me. Let's wait for @muellerzr return to office and get his opinion on this.

eljandoubi added 4 commits November 2, 2024 23:19

creeate _preprare_fsdp to pre- prepare fsdp model training

ad0bf18

empty optimizers parameters list before flattening

203d3af

use 'is' instead of 'in' for conditions

c0f34a2

when use_oring_params=True, skip optimizers params update

cf3ae98

eljandoubi changed the title ~~creeate _preprare_fsdp to pre- prepare fsdp model training~~ create _preprare_fsdp to pre- prepare fsdp model training Nov 7, 2024

eljandoubi and others added 2 commits November 20, 2024 20:37

solve conflict

605c47e

Merge branch 'huggingface:main' into align-optimizer-params-with-the-…

b523906

…fsdp-model-parms

Merge branch 'huggingface:main' into align-optimizer-params-with-the-…

0719ab0

…fsdp-model-parms

BenjaminBossan reviewed Nov 25, 2024

View reviewed changes

BenjaminBossan mentioned this pull request Nov 25, 2024

# [BUG] [Fix-Suggested] Model Training Stalls with FSDP when fsdp_use_orig_params=False due to inconsistent model-optimizer state #3256

Open

eljandoubi and others added 2 commits November 26, 2024 00:11

Merge branch 'huggingface:main' into align-optimizer-params-with-the-…

6cf8bac

…fsdp-model-parms

Clear non-utilized objects from all types of memory.

aaf463f

BenjaminBossan reviewed Nov 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create _preprare_fsdp to pre- prepare fsdp model training #3213

create _preprare_fsdp to pre- prepare fsdp model training #3213

eljandoubi commented Nov 3, 2024

eljandoubi commented Nov 13, 2024

HuggingFaceDocBuilderDev commented Nov 21, 2024

BenjaminBossan left a comment

BenjaminBossan Nov 25, 2024

eljandoubi Nov 25, 2024

BenjaminBossan Nov 26, 2024

BenjaminBossan Nov 25, 2024

eljandoubi commented Nov 26, 2024

BenjaminBossan left a comment

BenjaminBossan Nov 26, 2024

eljandoubi commented Nov 26, 2024 •

edited

Loading

BenjaminBossan commented Nov 28, 2024 •

edited

Loading

create _preprare_fsdp to pre- prepare fsdp model training #3213

Are you sure you want to change the base?

create _preprare_fsdp to pre- prepare fsdp model training #3213

Conversation

eljandoubi commented Nov 3, 2024

Who can review?

eljandoubi commented Nov 13, 2024

HuggingFaceDocBuilderDev commented Nov 21, 2024

BenjaminBossan left a comment

Choose a reason for hiding this comment

BenjaminBossan Nov 25, 2024

Choose a reason for hiding this comment

eljandoubi Nov 25, 2024

Choose a reason for hiding this comment

BenjaminBossan Nov 26, 2024

Choose a reason for hiding this comment

BenjaminBossan Nov 25, 2024

Choose a reason for hiding this comment

eljandoubi commented Nov 26, 2024

BenjaminBossan left a comment

Choose a reason for hiding this comment

BenjaminBossan Nov 26, 2024

Choose a reason for hiding this comment

eljandoubi commented Nov 26, 2024 • edited Loading

BenjaminBossan commented Nov 28, 2024 • edited Loading

eljandoubi commented Nov 26, 2024 •

edited

Loading

BenjaminBossan commented Nov 28, 2024 •

edited

Loading