Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
| ## | |
| Below are example yamls for using multi-gpu training with 4 GPUs on two machines (nodes) where each machine has two GPUs: | |
| On machine 1 (host): | |
| <pre> | |
| compute_environment: LOCAL_MACHINE | |
| deepspeed_config: {} | |
| +distributed_type: MULTI_GPU | |
| downcast_bf16: 'no' | |
| dynamo_backend: 'NO' | |
| fsdp_config: {} | |
| gpu_ids: all | |
| +machine_rank: 0 | |
| +main_process_ip: 192.168.20.1 | |
| +main_process_port: 8080 | |
| main_training_function: main | |
| megatron_lm_config: {} | |
| mixed_precision: 'no' | |
| +num_machines: 2 | |
| +num_processes: 8 | |
| +rdzv_backend: static | |
| +same_network: true | |
| use_cpu: false | |
| </pre> | |
| On machine 2: | |
| <pre> | |
| compute_environment: LOCAL_MACHINE | |
| deepspeed_config: {} | |
| +distributed_type: MULTI_GPU | |
| downcast_bf16: 'no' | |
| dynamo_backend: 'NO' | |
| fsdp_config: {} | |
| gpu_ids: all | |
| -machine_rank: 0 | |
| +machine_rank: 1 | |
| +main_process_ip: 192.168.20.1 | |
| +main_process_port: 8080 | |
| main_training_function: main | |
| megatron_lm_config: {} | |
| mixed_precision: 'no' | |
| +num_machines: 2 | |
| +num_processes: 8 | |
| +rdzv_backend: static | |
| +same_network: true | |
| use_cpu: false | |
| </pre> | |
| ## | |
| None | |
| ## | |
| To launch a script, on each machine run one of the following variations: | |
| If the YAML was generated through the `accelerate config` command: | |
| ``` | |
| accelerate launch {script_name.py} {--arg1} {--arg2} ... | |
| ``` | |
| If the YAML is saved to a `~/config.yaml` file: | |
| ``` | |
| accelerate launch --config_file ~/config.yaml {script_name.py} {--arg1} {--arg2} ... | |
| ``` | |
| Or you can use `accelerate launch` with right configuration parameters and have no `config.yaml` file: | |
| Replace `{node_number}` with appropriate machine number (0 for host, 1+ if not). | |
| ``` | |
| accelerate launch --multi_gpu --num_machines=2 --num_processes=8 --main_process_ip="192.168.20.1" --main_process_port=8080 | |
| --machine_rank={node_number} {script_name.py} {--arg1} {--arg2} ... | |
| ``` | |
| ## | |
| When utilizing multiple machines (nodes) for training, the config file needs to know how each machine will be able to communicate (the IP address and port), how many *total* GPUs there are, and whether the current machine is either the host or a client. | |
| **Remember that you can always use the `accelerate launch` functionality, even if the code in your script does not use the `Accelerator`** | |
| ## | |
| To learn more checkout the related documentation: | |
| - <a href="https://huggingface.co/docs/accelerate/main/en/basic_tutorials/launch" target="_blank">Launching distributed code</a> | |
| - <a href="https://huggingface.co/docs/accelerate/main/en/package_reference/cli" target="_blank">The Command Line</a> |