std::bad_alloc / DataLoader worker exited unexpectedly during training with smolvla_base

#14

by DanielRmc - opened 15 days ago

15 days ago

I'm encountering a std::bad_alloc error when running training using lerobot-train. The DataLoader worker fails with the following message:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
RuntimeError: DataLoader worker (pid(s) 2631794) exited unexpectedly

Here is the command I used for training:
lerobot-train
--policy.path=lerobot/smolvla_base
--dataset.repo_id=lerobot/svla_so101_pickplace
--batch_size=64
--steps=20000
--output_dir=outputs/train/my_smolvla_3
--job_name=my_smolvla_training
--policy.device=cuda
--wandb.enable=true

Environment:
Python 3.12
PyTorch 2.9 nightly + CUDA 12.9
Torchvision and Torchaudio dev versions matching PyTorch nightly
lerobot_venv virtual environment
GPU: Dual 96GB GPUs (2x96GB) RTX PRO 6000 Backwell
Memory: 512GB
Shared memory for the container set to 32GB

Why I'm Using This Setup: I chose to use PyTorch 2.9 and CUDA 12.9 because my GPUs are relatively new and require support for sm_120. These versions are necessary to fully leverage the capabilities of my hardware. This environment has been validated and verified as correct because I was able to successfully train the official diffusion_pusht model in this setup without error.

num_workers = 1 still results in the same error

Attempted Solutions:

When setting num_workers=0 and commenting out line 256 in lerobot_train.py (i.e., removing prefetch_factor=2), the training begins successfully, but it runs much slower.

The code snippet:
dataloader = torch.utils.data.DataLoader(
dataset,
num_workers=cfg.num_workers,
batch_size=cfg.batch_size,
shuffle=shuffle and not cfg.dataset.streaming,
sampler=sampler,
pin_memory=device.type == "cuda",
drop_last=False,
# prefetch_factor=2,
)

Questions:
Is this issue caused by memory allocation failure in DataLoader?

Should I try adjusting num_workers further or other DataLoader settings to reduce memory usage?

Could the issue be related to memory fragmentation or CUDA memory management?

Any recommendations on configuring DataLoader or other parameters to prevent the worker from exiting unexpectedly on my server setup?

How can I improve the training speed if I need to disable prefetch_factor to avoid this error?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment