Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
| ## | |
| Below is an example yaml for BF16 mixed-precision training using PyTorch Fully Sharded Data Parallism (FSDP) with CPU offloading on 8 GPUs. | |
| <pre> | |
| compute_environment: LOCAL_MACHINE | |
| deepspeed_config: {} | |
| +distributed_type: FSDP | |
| downcast_bf16: 'no' | |
| dynamo_backend: 'NO' | |
| +fsdp_config: | |
| + fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP | |
| + fsdp_backward_prefetch_policy: BACKWARD_PRE | |
| + fsdp_offload_params: true | |
| + fsdp_sharding_strategy: 1 | |
| + fsdp_state_dict_type: FULL_STATE_DICT | |
| + fsdp_transformer_layer_cls_to_wrap: T5Block | |
| machine_rank: 0 | |
| main_training_function: main | |
| megatron_lm_config: {} | |
| mixed_precision: bf16 | |
| num_machines: 1 | |
| +num_processes: 8 | |
| rdzv_backend: static | |
| same_network: true | |
| use_cpu: false | |
| </pre> | |
| ## | |
| <pre> | |
| from accelerate import Accelerator | |
| def main(): | |
| accelerator = Accelerator() | |
| - model, optimizer, dataloader, scheduler = accelerator.prepare( | |
| - model, optimizer, dataloader, scheduler | |
| - ) | |
| + model = accelerator.prepare(model) | |
| + # Optimizer can be any PyTorch optimizer class | |
| + optimizer = torch.optim.AdamW(params=model.parameters(), lr=lr) | |
| + optimizer, dataloader, scheduler = accelerator.prepare( | |
| + optimizer, dataloader, scheduler | |
| + ) | |
| ... | |
| accelerator.unwrap_model(model).save_pretrained( | |
| args.output_dir, | |
| is_main_process=accelerator.is_main_process, | |
| save_function=accelerator.save, | |
| + state_dict=accelerator.get_state_dict(model) | |
| ) | |
| ... | |
| </pre> | |
| ## | |
| If the YAML was generated through the `accelerate config` command: | |
| ``` | |
| accelerate launch {script_name.py} {--arg1} {--arg2} ... | |
| ``` | |
| If the YAML is saved to a `~/config.yaml` file: | |
| ``` | |
| accelerate launch --config_file ~/config.yaml {script_name.py} {--arg1} {--arg2} ... | |
| ``` | |
| Or you can use `accelerate launch` with right configuration parameters and have no `config.yaml` file: | |
| ``` | |
| accelerate launch \ | |
| --use_fsdp \ | |
| --num_processes=8 \ | |
| --mixed_precision=bf16 \ | |
| --fsdp_sharding_strategy=1 \ | |
| --fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP \ | |
| --fsdp_transformer_layer_cls_to_wrap=T5Block \ | |
| --fsdp_offload_params=true \ | |
| {script_name.py} {--arg1} {--arg2} ... | |
| ``` | |
| ## | |
| For PyTorch FDSP, you need to prepare the model first **before** preparing the optimizer since FSDP will shard parameters in-place and this will break any previously initialized optimizers. | |
| For transformer models, please use `TRANSFORMER_BASED_WRAP` auto wrap policy as shown in the config above. | |
| ## | |
| To learn more checkout the related documentation: | |
| - <a href="https://huggingface.co/docs/accelerate/usage_guides/fsdp" target="_blank">How to use Fully Sharded Data Parallelism</a> | |
| - <a href="https://huggingface.co/blog/pytorch-fsdp" target="_blank">Accelerate Large Model Training using PyTorch Fully Sharded Data Parallel</a> |