SWIFT supports fine-tuning and inference of AnimateDiff of full parameter and LoRA fine-tuning.
First, you need to clone and install SWIFT:
git clone https://github.com/modelscope/swift.git
cd swift
pip install ".[aigc]"
Full parameter fine-tuning can reproduce the effect of the officially provided model animatediff-motion-adapter-v1-5-2, requiring a large number of short videos. The official reproduction used a subset version of the official dataset: WebVid 2.5M. The training effect is as follows:
Prompt:masterpiece, bestquality, highlydetailed, ultradetailed, girl, walking, on the street, flowers
Prompt: masterpiece, bestquality, highlydetailed, ultradetailed, beautiful house, mountain, snow top```
The generation effect of training with the 2.5M subset still has unstable results. Developers using the 10M dataset will have more stable effects.
# This file is in swift/examples/pytorch/animatediff/scripts/full
# Experimental environment: A100 * 4
# 200GB GPU memory totally
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
torchrun --nproc_per_node=4 animatediff_sft.py \
--model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
--csv_path /mnt/workspace/yzhao/tastelikefeet/webvid/results_2M_train.csv \
--video_folder /mnt/workspace/yzhao/tastelikefeet/webvid/videos2 \
--sft_type full \
--lr_scheduler_type constant \
--trainable_modules .*motion_modules.* \
--batch_size 4 \
--eval_steps 100 \
--gradient_accumulation_steps 16
We used A100 * 4 for training, requiring a total of 200GB GPU memory, and the training time is about 40 hours. The data format is as follows:
--csv_path # Pass in a csv file, which should contain the following format:
name,contentUrl
Travel blogger shoot a story on top of mountains. young man holds camera in forest.,stock-footage-travel-blogger-shoot-a-story-on-top-of-mountains-young-man-holds-camera-in-forest.mp4
The name field represents the prompt of the short video, and contentUrl represents the name of the video file.
--video_folder Pass in a video directory containing all the video files referenced by contentUrl in the csv file.
To perform inference using full parameters:
# This file is in swift/examples/pytorch/animatediff/scripts/full
# Experimental environment: A100
# 18GB GPU memory
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0 \
python animatediff_infer.py \
--model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
--sft_type full \
--ckpt_dir /output/path/like/checkpoints/iter-xxx \
--eval_human true
The --ckpt_dir should be the output folder from training.
Full parameter training will train the entire Motion-Adapter structure from scratch. Users can use an existing model and a small number of videos for fine-tuning by running the following command:
# This file is in swift/examples/pytorch/animatediff/scripts/lora
# Experimental environment: A100
# 20GB GPU memory
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0 \
python animatediff_sft.py \
--model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
--csv_path /mnt/workspace/yzhao/tastelikefeet/webvid/results_2M_train.csv \
--video_folder /mnt/workspace/yzhao/tastelikefeet/webvid/videos2 \
--motion_adapter_id_or_path Shanghai_AI_Laboratory/animatediff-motion-adapter-v1-5-2 \
--sft_type lora \
--lr_scheduler_type constant \
--trainable_modules .*motion_modules.* \
--batch_size 1 \
--eval_steps 200 \
--dataset_sample_size 10000 \
--gradient_accumulation_steps 16
Video data parameters are the same as above.
The inference command is as follows:
# This file is in swift/examples/pytorch/animatediff/scripts/lora
# Experimental environment: A100
# 18GB GPU memory
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0 \
python animatediff_infer.py \
--model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
--motion_adapter_id_or_path Shanghai_AI_Laboratory/animatediff-motion-adapter-v1-5-2 \
--sft_type lora \
--ckpt_dir /output/path/like/checkpoints/iter-xxx \
--eval_human true
The --ckpt_dir should be the output folder from training.
Below are the supported parameter lists and their meanings for training and inference respectively:
motion_adapter_id_or_path: Optional[str] = None # The model ID or model path of the motion adapter. Specifying this parameter allows for continued training based on the effect of existing official models.
motion_adapter_revision: Optional[str] = None # The model revision of the motion adapter, only useful when motion_adapter_id_or_path is the model ID.
model_id_or_path: str = None # The model ID or model path of the SD base model.
model_revision: str = None # The revision of the SD base model, only useful when model_id_or_path is the model ID.
dataset_sample_size: int = None # The number of training samples in the dataset. Default represents full training.
sft_type: str = field(
default='lora', metadata={'choices': ['lora', 'full']}) # Training method, supporting lora and full parameters.
output_dir: str = 'output' # Output folder.
ddp_backend: str = field(
default='nccl', metadata={'choices': ['nccl', 'gloo', 'mpi', 'ccl']}) # If using ddp training, ddp backend.
seed: int = 42 # Random seed.
lora_rank: int = 8 # lora parameter.
lora_alpha: int = 32 # lora parameter.
lora_dropout_p: float = 0.05 # lora parameter.
lora_dtype: str = 'fp32' # lora module dtype type. If `AUTO`, it follows the dtype setting of the original module.
gradient_checkpointing: bool = False # Whether to enable gc, disabled by default. Note: The current version of diffusers has a problem and does not support this parameter being True.
batch_size: int = 1 # batchsize.
num_train_epochs: int = 1 # Number of epochs.
# if max_steps >= 0, override num_train_epochs
learning_rate: Optional[float] = None # Learning rate.
weight_decay: float = 0.01 # adamw parameter.
gradient_accumulation_steps: int = 16 # ga size.
max_grad_norm: float = 1. # grad norm size.
lr_scheduler_type: str = 'cosine' # Type of lr_scheduler.
warmup_ratio: float = 0.05 # Whether to warmup and the proportion of warmup.
eval_steps: int = 50 # eval step interval.
save_steps: Optional[int] = None # save step interval.
dataloader_num_workers: int = 1 # Number of dataloader workers.
push_to_hub: bool = False # Whether to push to modelhub.
# 'user_name/repo_name' or 'repo_name'
hub_model_id: Optional[str] = None # modelhub id.
hub_private_repo: bool = False
push_hub_strategy: str = field( # Push strategy, push the last one or push each one.
default='push_best',
metadata={'choices': ['push_last', 'all_checkpoints']})
# None: use env var `MODELSCOPE_API_TOKEN`
hub_token: Optional[str] = field( # modelhub token.
default=None,
metadata={
'help':
'SDK token can be found in https://modelscope.cn/my/myaccesstoken'
})
ignore_args_error: bool = False # True: notebook compatibility.
text_dropout_rate: float = 0.1 # Drop a certain proportion of text to ensure model robustness.
validation_prompts_path: str = field( # The prompt file directory used in the evaluation process. By default, swift/aigc/configs/validation.txt is used.
default=None,
metadata={
'help':
'The validation prompts file path, use aigc/configs/validation.txt is None'
})
trainable_modules: str = field( # Trainable modules, recommended to use the default value.
default='.*motion_modules.*',
metadata={
'help':
'The trainable modules, by default, the .*motion_modules.* will be trained'
})
mixed_precision: bool = True # Mixed precision training.
enable_xformers_memory_efficient_attention: bool = True # Use xformers.
num_inference_steps: int = 25 #
guidance_scale: float = 8.
sample_size: int = 256
sample_stride: int = 4 # Maximum length of training videos in seconds.
sample_n_frames: int = 16 # Frames per second.
csv_path: str = None # Input dataset.
video_folder: str = None # Input dataset.
motion_num_attention_heads: int = 8 # motion adapter parameter.
motion_max_seq_length: int = 32 # motion adapter parameter.
num_train_timesteps: int = 1000 # Inference pipeline parameter.
beta_start: int = 0.00085 # Inference pipeline parameter.
beta_end: int = 0.012 # Inference pipeline parameter.
beta_schedule: str = 'linear' # Inference pipeline parameter.
steps_offset: int = 1 # Inference pipeline parameter.
clip_sample: bool = False # Inference pipeline parameter.
use_wandb: bool = False # Whether to use wandb.
motion_adapter_id_or_path: Optional[str] = None # The model ID or model path of the motion adapter. Specifying this parameter allows for continued training based on the effect of existing official models.
motion_adapter_revision: Optional[str] = None # The model revision of the motion adapter, only useful when motion_adapter_id_or_path is the model ID.
model_id_or_path: str = None # The model ID or model path of the SD base model.
model_revision: str = None # The revision of the SD base model, only useful when model_id_or_path is the model ID.
sft_type: str = field(
default='lora', metadata={'choices': ['lora', 'full']}) # Training method, supporting lora and full parameters.
ckpt_dir: Optional[str] = field(
default=None, metadata={'help': '/path/to/your/vx-xxx/checkpoint-xxx'}) # The output folder of training.
eval_human: bool = False # False: eval val_dataset # Whether to use manual input evaluation.
seed: int = 42 # Random seed.
merge_lora: bool = False # Merge lora into the MotionAdapter and save the model.
replace_if_exists: bool = False # Replace the files if the output merged dir exists when `merge_lora` is True.
# other
ignore_args_error: bool = False # True: notebook compatibility.
validation_prompts_path: str = None # The file used for validation. When eval_human=False, each line is a prompt.
output_path: str = './generated' # The output directory for gifs.
enable_xformers_memory_efficient_attention: bool = True # Use xformers.
num_inference_steps: int = 25 #
guidance_scale: float = 8.
sample_size: int = 256
sample_stride: int = 4 # Maximum length of training videos in seconds.
sample_n_frames: int = 16 # Frames per second.
motion_num_attention_heads: int = 8 # motion adapter parameter.
motion_max_seq_length: int = 32 # motion adapter parameter.
num_train_timesteps: int = 1000 # Inference pipeline parameter.
beta_start: int = 0.00085 # Inference pipeline parameter.
beta_end: int = 0.012 # Inference pipeline parameter.
beta_schedule: str = 'linear' # Inference pipeline parameter.
steps_offset: int = 1 # Inference pipeline parameter.
clip_sample: bool = False # Inference pipeline parameter.