llm-conf

Running

App Files Files Community

muellerzr commited on Dec 8, 2022

Commit

c43c604

1 Parent(s): b0d1496

Add notebook

Browse files

Files changed (1) hide show

Accelerate.ipynb +612 -0

Accelerate.ipynb ADDED Viewed

	@@ -0,0 +1,612 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ff5c7a97-02d5-4aea-8bd5-59be5e62bf01",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "title: \"Accelerate, Three Powerful Sublibraries for PyTorch\"\n",
+    "author: \"Zachary Mueller\"\n",
+    "format: \n",
+    "    revealjs:\n",
+    "        theme: moon\n",
+    "        fig-format: png\n",
+    "---"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "f2333422",
+   "metadata": {},
+   "source": [
+    "## Test Gradio {background-iframe=\"https://muellerzr-accelerate-presentation.hf.space\"}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "45e61402-f734-4500-8eb6-fcdd6f17a0d4",
+   "metadata": {},
+   "source": [
+    "## Who am I?\n",
+    "\n",
+    "- Zachary Mueller\n",
+    "- Deep Learning Software Engineer at 🤗\n",
+    "- API design geek"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8f9864d2-5787-4af3-a08d-b372e5851a0f",
+   "metadata": {},
+   "source": [
+    "## What is 🤗 Accelerate?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "166b148a-e2f0-46b0-bc61-ac6e81da5ac5",
+   "metadata": {},
+   "source": [
+    "```{mermaid}\n",
+    "%%| fig-height: 6\n",
+    "graph LR\n",
+    "    A{\"🤗 Accelerate#32;\"}\n",
+    "    A --> B[\"Launching<br>Interface#32;\"]\n",
+    "    A --> C[\"Training Library#32;\"]\n",
+    "    A --> D[\"Big Model<br>Inference#32;\"]\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "84d6fd12-18cd-4448-9123-821133673b95",
+   "metadata": {},
+   "source": [
+    "# A Launching Interface\n",
+    "\n",
+    "Can't I just use `python do_the_thing.py`?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e5488645-daa3-4353-be9f-7af765a52666",
+   "metadata": {},
+   "source": [
+    "## A Launching Interface\n",
+    "\n",
+    "Launching scripts in different environments is complicated:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ce856633-1909-4f18-9610-e934194dd584",
+   "metadata": {},
+   "source": [
+    "- ```bash \n",
+    "python script.py\n",
+    "```\n",
+    "\n",
+    "- ```bash \n",
+    "torchrun --nnodes=1 --nproc_per_node=2 script.py\n",
+    "```\n",
+    "\n",
+    "- ```bash \n",
+    "deepspeed --num_gpus=2 script.py\n",
+    "```\n",
+    "\n",
+    "And more!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4e6414d0-f8f8-4bd2-b06f-fe7f848320f1",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## A Launching Interface\n",
+    "\n",
+    "But it doesn't have to be:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5dfd30c0-7240-4a13-9b51-061c4762b37e",
+   "metadata": {},
+   "source": [
+    "```bash\n",
+    "accelerate launch script.py\n",
+    "```\n",
+    "\n",
+    "A single command to launch with `DeepSpeed`, Fully Sharded Data Parallelism, across single and multi CPUs and GPUs, and to train on TPUs[^1] too! \n",
+    "\n",
+    "[^1]: Without needing to modify your code and create a `_mp_fn`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c0760c9a-4307-4143-9adc-bf1ce2ed4460",
+   "metadata": {},
+   "source": [
+    "## A Launching Interface\n",
+    "\n",
+    "Generate a device-specific configuration through `accelerate config`\n",
+    "\n",
+    "![](CLI.gif)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b0f1dc7a-ec43-48ba-b0a0-1331981733d0",
+   "metadata": {},
+   "source": [
+    "## A Launching Interface\n",
+    "\n",
+    "Or don't. `accelerate config` doesn't *have* to be done!\n",
+    "\n",
+    "```bash\n",
+    "torchrun --nnodes=1 --nproc_per_node=2 script.py\n",
+    "accelerate launch --multi_gpu --nproc_per_node=2 script.py\n",
+    "```\n",
+    "\n",
+    "A quick default configuration can be made too:\n",
+    "\n",
+    "```bash \n",
+    "accelerate config default\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ff8d2c3d-5a08-4e5b-9896-1a0bcb77b5a6",
+   "metadata": {},
+   "source": [
+    "## A Launching Interface"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a395af44-96f8-4f3a-ac47-3f65a6062d24",
+   "metadata": {},
+   "source": [
+    "With the `notebook_launcher` it's also possible to launch code directly from your Jupyter environment too!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "99b14b46-6be5-4ef4-a3ee-82876b1d7802",
+   "metadata": {},
+   "source": [
+    "```python\n",
+    "from accelerate import notebook_launcher\n",
+    "notebook_launcher(\n",
+    "    training_loop_function, \n",
+    "    args, \n",
+    "    num_processes=2\n",
+    ")\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a50e27a7-4235-4695-bf99-59c0f3d0e451",
+   "metadata": {},
+   "source": [
+    "```python\n",
+    "Launching training on 2 GPUs.\n",
+    "epoch 0: 88.12\n",
+    "epoch 1: 91.73\n",
+    "epoch 2: 92.58\n",
+    "epoch 3: 93.90\n",
+    "epoch 4: 94.71\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2db4e66d-d8b0-4f3f-9236-e86c1c3ea5d2",
+   "metadata": {},
+   "source": [
+    "# A Training Library\n",
+    "\n",
+    "Okay, will `accelerate launch` make `do_the_thing.py` use all my GPUs magically?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1cd093ef-d3ce-4ea4-89a1-be145fbe5cc0",
+   "metadata": {},
+   "source": [
+    "## A Training Library\n",
+    "\n",
+    "- Just showed that its possible using `accelerate launch` to *launch* a python script in various distributed environments\n",
+    "- This does *not* mean that the script will just \"use\" that code and still run on the new compute efficiently.\n",
+    "- Training on different computes often means *many* lines of code changed for each specific compute.\n",
+    "- 🤗 `accelerate` solves this by ensuring the same code can be ran on a CPU or GPU, multiples, and on TPUs!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c0b12eb9-feeb-4040-a784-8e78966165be",
+   "metadata": {},
+   "source": [
+    "## A Training Library\n",
+    "\n",
+    "\n",
+    "```{.python}\n",
+    "for batch in dataloader:\n",
+    "    optimizer.zero_grad()\n",
+    "    inputs, targets = batch\n",
+    "    inputs = inputs.to(device)\n",
+    "    targets = targets.to(device)\n",
+    "    outputs = model(inputs)\n",
+    "    loss = loss_function(outputs, targets)\n",
+    "    loss.backward()\n",
+    "    optimizer.step()\n",
+    "    scheduler.step()\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bbb72602-f86f-42f6-ab44-05fbd0dfcecd",
+   "metadata": {},
+   "source": [
+    "## A Training Library {.smaller}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b5f90b84-fff5-4c14-bde7-d1efbcc37781",
+   "metadata": {},
+   "source": [
+    ":::: {.columns}\n",
+    "::: {.column width=\"43%\"}\n",
+    "<br><br><br>\n",
+    "```{.python code-line-numbers=\"5-6,9\"}\n",
+    "# For alignment purposes\n",
+    "for batch in dataloader:\n",
+    "    optimizer.zero_grad()\n",
+    "    inputs, targets = batch\n",
+    "    inputs = inputs.to(device)\n",
+    "    targets = targets.to(device)\n",
+    "    outputs = model(inputs)\n",
+    "    loss = loss_function(outputs, targets)\n",
+    "    loss.backward()\n",
+    "    optimizer.step()\n",
+    "    scheduler.step()\n",
+    "```\n",
+    ":::\n",
+    "::: {.column width=\"57%\"}\n",
+    "```{.python code-line-numbers=\"1-7,12-13,16\"}\n",
+    "from accelerate import Accelerator\n",
+    "accelerator = Accelerator()\n",
+    "dataloader, model, optimizer scheduler = (\n",
+    "    accelerator.prepare(\n",
+    "        dataloader, model, optimizer, scheduler\n",
+    "    )\n",
+    ")\n",
+    "\n",
+    "for batch in dataloader:\n",
+    "    optimizer.zero_grad()\n",
+    "    inputs, targets = batch\n",
+    "    # inputs = inputs.to(device)\n",
+    "    # targets = targets.to(device)\n",
+    "    outputs = model(inputs)\n",
+    "    loss = loss_function(outputs, targets)\n",
+    "    accelerator.backward(loss) # loss.backward()\n",
+    "    optimizer.step()\n",
+    "    scheduler.step()\n",
+    "```\n",
+    ":::\n",
+    "\n",
+    "::::"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "60c90913-2542-4b1d-8121-b2228c8a2ef7",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## A Training Library\n",
+    "\n",
+    "What all happened in `Accelerator.prepare`?\n",
+    "\n",
+    "::: {.incremental}\n",
+    "1. `Accelerator` looked at the configuration\n",
+    "2. The `dataloader` was converted into one that can dispatch each batch onto a seperate GPU\n",
+    "3. The `model` was wrapped with the appropriate DDP wrapper from either `torch.distributed` or `torch_xla`\n",
+    "4. The `optimizer` and `scheduler` were both converted into an `AcceleratedOptimizer` and `AcceleratedScheduler` which knows how to handle any distributed scenario\n",
+    ":::"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "59400a16-bce7-4a0a-8548-effd3c4c6cae",
+   "metadata": {},
+   "source": [
+    "## A Training Library, Mixed Precision\n",
+    "\n",
+    "🤗 `accelerate` also supports *automatic mixed precision*. \n",
+    "\n",
+    "Through a single flag to the `Accelerator` object when calling `accelerator.backward()` the mixed precision of your choosing (such as `bf16` or `fp16`) will be applied:\n",
+    "\n",
+    "```{.python code-line-numbers=\"2,9\"}\n",
+    "from accelerate import Accelerator\n",
+    "accelerator = Accelerator(mixed_precision=\"fp16\")\n",
+    "...\n",
+    "for batch in dataloader:\n",
+    "    optimizer.zero_grad()\n",
+    "    inputs, targets = batch\n",
+    "    outputs = model(inputs)\n",
+    "    loss = loss_function(outputs, targets)\n",
+    "    accelerator.backward(loss)\n",
+    "    optimizer.step()\n",
+    "    scheduler.step()\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fde7ae10-4fbd-4e25-8f5d-9d47c849966d",
+   "metadata": {},
+   "source": [
+    "## A Training Library, Gradient Accumulation\n",
+    "\n",
+    "Gradient accumulation in distributed setups often need extra care to ensure gradients are aligned when they need to be and the backward pass is computationally efficient.\n",
+    "\n",
+    "🤗 `accelerate` can just easily handle this for you:\n",
+    "\n",
+    "```{.python code-line-numbers=\"2,5\"}\n",
+    "from accelerate import Accelerator\n",
+    "accelerator = Accelerator(gradient_accumulation_steps=4)\n",
+    "...\n",
+    "for batch in dataloader:\n",
+    "    with accelerator.accumulate(model)\n",
+    "        optimizer.zero_grad()\n",
+    "        inputs, targets = batch\n",
+    "        outputs = model(inputs)\n",
+    "        loss = loss_function(outputs, targets)\n",
+    "        accelerator.backward(loss)\n",
+    "        optimizer.step()\n",
+    "        scheduler.step()\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "13f2d1e7-1e50-4a28-b7b4-55e09e15c176",
+   "metadata": {},
+   "source": [
+    "## A Training Library, Gradient Accumulation\n",
+    "\n",
+    "```{.python code-line-numbers=\"5-7,10,11,12,15\"}\n",
+    "ddp_model, dataloader = accelerator.prepare(model, dataloader)\n",
+    "\n",
+    "for index, batch in enumerate(dataloader):\n",
+    "    inputs, targets = batch\n",
+    "    if index != (len(dataloader)-1) or (index % 4) != 0:\n",
+    "        # Gradients don't sync\n",
+    "        with accelerator.no_sync(model):\n",
+    "            outputs = ddp_model(inputs)\n",
+    "            loss = loss_func(outputs, targets)\n",
+    "            accelerator.backward(loss)\n",
+    "    else:\n",
+    "        # Gradients finally sync\n",
+    "        outputs = ddp_model(inputs)\n",
+    "        loss = loss_func(outputs)\n",
+    "        accelerator.backward(loss)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "93575b12-8000-4e8c-81fb-74af415fd76b",
+   "metadata": {},
+   "source": [
+    "# Big Model Inference\n",
+    "\n",
+    "Stable Diffusion taking the world by storm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b3026c5d-c051-4eac-a4be-af6559294225",
+   "metadata": {},
+   "source": [
+    "## Bigger Models == Higher Compute\n",
+    "\n",
+    "As more large models were being released, Hugging Face quickly realized there must be a way to continue our decentralization of Machine Learning and have the day-to-day programmer be able to leverage these big models.\n",
+    "\n",
+    "Born out of this effort by Sylvain Gugger: \n",
+    "\n",
+    "🤗 Accelerate: Big Model Inference."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "303925bf-ce22-4e71-a239-69eb419d54d3",
+   "metadata": {},
+   "source": [
+    "## The Basic Premise\n",
+    "\n",
+    "::: {.incremental}\n",
+    "* In PyTorch, there exists the `meta` device. \n",
+    "\n",
+    "* Super small footprint to load in huge models quickly by not loading in their weights immediatly.\n",
+    "\n",
+    "* As an input gets passed through each layer, we can load and unload *parts* of the PyTorch model quickly so that only a small portion of the big model is loaded in at a single time.\n",
+    "\n",
+    "* The end result? Stable Diffusion v1 can be ran on < 800mb of vRAM\n",
+    ":::"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c6eef166-c64b-4229-9575-b197c3c03c59",
+   "metadata": {},
+   "source": [
+    "## The Code\n",
+    "\n",
+    "Generally you start with something like so:\n",
+    "\n",
+    "```python\n",
+    "import torch\n",
+    "\n",
+    "my_model = ModelClass(...)\n",
+    "state_dict = torch.load(checkpoint_file)\n",
+    "my_model.load_state_dict(state_dict)\n",
+    "```\n",
+    "\n",
+    "But this has issues:\n",
+    "\n",
+    "1. The full version of the model is loaded at `3`\n",
+    "2. Another version of the model is loaded into memory at `4`\n",
+    "\n",
+    "If a 6 *billion* parameter model is being loaded, each model class has a dictionary of 24GB so 48GB of vRAM is needed"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "53651488-7303-4aa3-83bb-ea7331938a01",
+   "metadata": {},
+   "source": [
+    "## Empty Model Weights\n",
+    "\n",
+    "We can fix step 1 by loading in an empty model skeleton at first:\n",
+    "\n",
+    "```{.python code-line-numbers=\"2,4-5\"}\n",
+    "from accelerate import init_empty_weights\n",
+    "\n",
+    "with init_empty_weights():\n",
+    "    my_model = ModelClass(...)\n",
+    "state_dict = torch.load(checkpoint_file)\n",
+    "my_model.load_state_dict(state_dict)\n",
+    "```\n",
+    "\n",
+    "::: {.callout-important appearance=\"default\"}\n",
+    "## This code will not run\n",
+    "It is likely that just calling `my_model(x)` will fail as not all tensor operations are supported on the `meta` device.\n",
+    ":::"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "94a2b99a-b154-4cc3-93fd-431ba78ecfdf",
+   "metadata": {},
+   "source": [
+    "## Sharded Checkpoints - The Concept\n",
+    "\n",
+    "The next step is to have \"Sharded Checkpoints\" saved for your model.\n",
+    "\n",
+    "Basically smaller chunks of your model weights stored that can be brought in at any particular time. \n",
+    "\n",
+    "This reduces the amount of memory step 2 takes in since we can just load in a \"chunk\" of the model at a time, then swap it out for a new chunk through PyTorch hooks"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "11a55882-8bab-4d6b-b8ca-bfc886351156",
+   "metadata": {},
+   "source": [
+    "## Sharded Checkpoints - The Code\n",
+    "\n",
+    "```{.python code-line-numbers=\"1,6-8\"}\n",
+    "from accelerate import init_empty_weights, load_checkpoint_and_dispatch\n",
+    "\n",
+    "with init_empty_weights():\n",
+    "    my_model = ModelClass(...)\n",
+    "\n",
+    "my_model = load_checkpoint_and_dispatch(\n",
+    "    y_model, \"sharted-weights\", device_map=\"auto\"\n",
+    ")\n",
+    "```\n",
+    "`device_map=\"auto\"` will tell 🤗 Accelerate that it should determine where to put each layer of the model:\n",
+    "\n",
+    "1. Maximum space on the GPU(s)\n",
+    "2. Maximum space on the CPU(s)\n",
+    "3. Utilize disk space through memory-mapped tensors"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6796c0ac-77e4-4f88-b01a-25f428b29a87",
+   "metadata": {},
+   "source": [
+    "## Big Model Inference Put Together\n",
+    "\n",
+    "```{.python}\n",
+    "from accelerate import init_empty_weights, load_checkpoint_and_dispatch\n",
+    "\n",
+    "with init_empty_weights():\n",
+    "    my_model = ModelClass(...)\n",
+    "\n",
+    "my_model = load_checkpoint_and_dispatch(\n",
+    "    y_model, \"sharted-weights\", device_map=\"auto\"\n",
+    ")\n",
+    "my_model.eval()\n",
+    "\n",
+    "for batch in dataloader:\n",
+    "    output = my_model(batch)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6f5122b2-f4fe-4237-aff2-d2a69f85b692",
+   "metadata": {},
+   "source": [
+    "# Thanks for Listening!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "52f29e81-2e55-42d0-8e9d-83e692714909",
+   "metadata": {},
+   "source": [
+    "## Some Handy Resources\n",
+    "\n",
+    "- [🤗 Accelerate documentation](https://hf.co/docs/accelerate)\n",
+    "- [Launching distributed code](https://huggingface.co/docs/accelerate/basic_tutorials/launch)\n",
+    "- [Distributed code and Jupyter Notebooks](https://huggingface.co/docs/accelerate/basic_tutorials/notebook)\n",
+    "- [Migrating to 🤗 Accelerate easily](https://huggingface.co/docs/accelerate/basic_tutorials/migration)\n",
+    "- [Big Model Inference tutorial](https://huggingface.co/docs/accelerate/usage_guides/big_modeling)\n",
+    "- [DeepSpeed and 🤗 Accelerate](https://huggingface.co/docs/accelerate/usage_guides/deepspeed)\n",
+    "- [Fully Sharded Data Parallelism and 🤗 Accelerate](https://huggingface.co/docs/accelerate/usage_guides/fsdp)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b9f6a92d-1275-470b-aa27-ff2be450d616",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}