AI & ML interests

None defined yet.

nouamanetaziĀ 
posted an update 3 days ago
view post
Post
2911
After training š’š¦šØš„š‹šŒšŸ‘ on šŸ‘šŸ–šŸ’ š‡šŸšŸŽšŸŽš¬ for nearly a month, I've come to realize something most people overlook: š¢š§šŸš«ššš¬š­š«š®šœš­š®š«šž š¢š¬ š­š”šž š¦ššš¤šž-šØš«-š›š«šžššš¤ šŸšššœš­šØš« š¢š§ š‹š‹šŒ š­š«ššš¢š§š¢š§š . šŸ”„

Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious šš‚š‚š‹ šžš«š«šØš«š¬, or when your expensive GPU cluster is running at šŸ”šŸŽ% šžšŸšŸš¢šœš¢šžš§šœš², the problem isn't your model. It's most probably a š¦š¢š¬š®š¬šž šØšŸ š­š”šž š”ššš«šš°ššš«šž. šŸ› ļø

Questions that seemed simple but had no clear answers: Why is šŒšØš„ š­š«ššš¢š§š¢š§š  š¬š„šØš°šžš« š­š”ššš§ ššžš§š¬šž š¦šØššžš„š¬? Which šš‚š‚š‹ šŸš„ššš š¬ should we actually set? How often should we checkpoint without killing throughput?

That's why we built š“š”šž š’š¦šØš„ š“š«ššš¢š§š¢š§š  šš„ššš²š›šØšØš¤ šŸ“–: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the š¢š§šŸš«ššš¬š­š«š®šœš­š®š«šž š„ššš²šžš« that most teams get wrong.

We validated real vs theoretical bandwidth across the entire stack: š‡ššŒšŸ‘ š”š¢š­š­š¢š§š  šŸ‘ š“š/š¬, šš•š‹š¢š§š¤ šŸ’.šŸŽ š«šžšššœš”š¢š§š  šŸ•šŸ–šŸ” š†š/š¬, šš‚šˆšž š†šžš§šŸ’ ššš­ šŸšŸ’.šŸ š†š/š¬. Then we ran collective operations across šŸšŸšŸ– š†šš”š¬ (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from šŸ’šŸ–šŸŽ š†š/š¬ on a single node to šŸ‘šŸšŸŽ-šŸ‘šŸ“šŸŽ š†š/š¬ across 16 nodes.

If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.

š“š”šž š’š¦šØš„ š“š«ššš¢š§š¢š§š  šš„ššš²š›šØšØš¤: https://lnkd.in/e5MKXUHS

Shared with ā¤ļø by the HuggingFace team
nouamanetaziĀ 
updated a Space 6 months ago
nouamanetaziĀ 
published a Space 6 months ago
nouamanetaziĀ 
posted an update over 1 year ago