Spaces:

Victarry
/

PP-schedule-visualizer

Running

App Files Files Community

Victarry commited on Mar 8

Commit

a49be3b

1 Parent(s): 7a4895e

Add VPP support and refactor project.

Browse files

Files changed (16) hide show

.gitignore +9 -77
LICENSE +21 -0
README-dash-visualizer.md +0 -91
README.md +80 -62
conf/config.yaml +22 -0
configs/standard.json +0 -8
main.py +62 -0
pipeline.py +0 -491
pipeline_1f1b.png +0 -3
pyproject.toml +67 -0
requirements-dash.txt +0 -5
src/__init__.py +3 -0
src/execution_model.py +219 -0
src/strategies.py +192 -0
dash_visualizer.py → src/visualizer.py +195 -157
visualizer.py +0 -141

.gitignore CHANGED Viewed

@@ -1,78 +1,10 @@
 # Python
-__pycache__/
-*.py[cod]
-*$py.class
-*.so
-.Python
-build/
-develop-eggs/
-dist/
-downloads/
-eggs/
-.eggs/
-lib/
-lib64/
-parts/
-sdist/
-var/
-wheels/
-*.egg-info/
-.installed.cfg
-*.egg
-# Virtual Environment
-venv/
-env/
-ENV/
-.env
-# IDE specific files
-.idea/
-.vscode/
-*.swp
-*.swo
-.DS_Store
-# Jupyter Notebook
-.ipynb_checkpoints
-# Distribution / packaging
-.Python
-env/
-build/
-develop-eggs/
-dist/
-downloads/
-eggs/
-.eggs/
-lib/
-lib64/
-parts/
-sdist/
-var/
-wheels/
-*.egg-info/
-.installed.cfg
-*.egg
-# Unit test / coverage reports
-htmlcov/
-.tox/
-.coverage
-.coverage.*
-.cache
-nosetests.xml
-coverage.xml
-*.cover
-.hypothesis/
-# Pipeline visualization outputs
-*.png
-*.jpg
-*.jpeg
-*.pdf
-*.svg
-# Local configuration
-config.ini
-secrets.json

 # Python
+./venv
+uv.lock
+outputs/
+# Uncomment below if you want to include these files
+# !assets/*.png
+# !assets/*.jpg
+# !docs/*.png
+# !docs/*.jpg

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2024
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README-dash-visualizer.md DELETED Viewed

@@ -1,91 +0,0 @@
-# Pipeline Parallelism Dash Visualizer
-This is an interactive Dash-based visualizer for pipeline parallelism scheduling, complementing the existing Matplotlib-based visualization.
-## Features
-- **Static image generation** similar to the Matplotlib version
-- **Interactive web-based visualization** with Dash
-- **Download functionality** to save the visualization as PNG
-- **Progress indication** during figure creation and image generation
-- **Compatible API** with the existing visualizer
-## Installation
-Install the required dependencies:
-```bash
-pip install -r requirements-dash.txt
-```
-## Usage
-### From Python
-```python
-from pipeline import create_1f1b_schedule
-from dash_visualizer import visualize_pipeline_parallelism_dash, save_pipeline_visualization_plotly
-# Create a schedule
-schedule = create_1f1b_schedule(
-    num_stages=4,
-    num_batches=8,
-    forward_times=[1.0, 1.0, 1.0, 1.0],
-    backward_times=[2.0, 2.0, 2.0, 2.0],
-)
-# Generate a static image
-save_pipeline_visualization_plotly(
-    schedule=schedule,
-    schedule_type="1f1b",
-    output_file="pipeline_plotly.png"
-)
-# OR launch an interactive Dash app
-visualize_pipeline_parallelism_dash(
-    schedule=schedule,
-    schedule_type="1f1b",
-    port=8050,
-    debug=False
-)
-```
-### Using the Command Line
-You can use the updated command line interface:
-```bash
-# Generate a static image with Dash/Plotly
-python pipeline.py --visualizer dash --output-file pipeline_viz.png
-# Launch an interactive Dash app
-python pipeline.py --visualizer dash-interactive
-# Use the original Matplotlib visualizer
-python pipeline.py --visualizer matplotlib
-```
-You can also use the dash_visualizer.py script directly for testing:
-```bash
-# Generate a static image
-python dash_visualizer.py --output test_viz.png
-# Launch an interactive app
-python dash_visualizer.py --interactive
-```
-## Differences from Matplotlib Visualizer
-The Dash-based visualizer provides all the same visual elements as the Matplotlib version:
-- Color-coded rectangles for forward, backward, and optimizer operations
-- Batch numbers displayed inside each rectangle
-- Device labels on the y-axis
-- Clear legend
-Additional features:
-- Interactive web interface
-- Hovering over elements to see details
-- Download button to save the visualization
-- Progress bars for tracking visualization creation
-- Responsive layout that works well on different screen sizes

README.md CHANGED Viewed

@@ -1,77 +1,95 @@
-# Pipeline Parallelism Scheduler and Visualizer
-This tool simulates and visualizes pipeline parallelism scheduling strategies, focusing on the 1F1B (One-Forward-One-Backward) scheduling algorithm commonly used in distributed deep learning.
-## Usage
-### Example
 ```bash
-python pipeline.py --num-stages 4 --num-batches 8
 ```
-![Example 1F1B schedule](pipeline_1f1b.png)
-### Command Line Interface
-| Option | Short | Description |
-|--------|-------|-------------|
-| `--config` | `-c` | Path to config file (JSON or YAML) |
-| `--num-stages` | `-s` | Number of pipeline stages (devices) |
-| `--num-batches` | `-b` | Number of micro-batches |
-| `--forward-times` | `-f` | Time for forward pass at each stage (space-separated list) |
-| `--backward-times` | `-bw` | Time for backward pass at each stage (space-separated list) |
-| `--output` | `-o` | Output file path for visualization |
-| `--no-visualization` | | Skip visualization generation |
-| `--p2p-time`| | P2P communication time of PP |
-### Using Configuration Files
-You can use either JSON or YAML configuration files:
-Example JSON configuration (sample_config.json):
-```json
-{
-    "num_stages": 6,
-    "num_batches": 12,
-    "forward_times": [0.8, 1.0, 1.2, 1.0, 0.9, 1.1],
-    "backward_times": [1.6, 2.0, 2.4, 2.0, 1.8, 2.2],
-    "output_file": "pipeline_1f1b_custom.png"
-}
 ```
-Example YAML configuration (sample_config.yaml):
-```yaml
-# Pipeline Parallelism Configuration
-num_stages: 5
-num_batches: 8
-forward_times:
-  - 0.9
-  - 1.1
-  - 1.0
-  - 0.8
-  - 1.2
-backward_times:
-  - 1.8
-  - 2.2
-  - 2.0
-  - 1.6
-  - 2.4
-output_file: "pipeline_1f1b_yaml.png"
 ```
-## About Pipeline Parallelism
-Pipeline parallelism is a distributed deep learning training strategy that splits model layers across multiple devices. Each device processes a different stage of the neural network, creating a pipeline where multiple micro-batches can be processed simultaneously.
-The 1F1B (One-Forward-One-Backward) scheduling algorithm is an efficient strategy for pipeline parallelism that balances throughput with memory usage. It follows these phases:
-1. **Warmup Phase**: Forward passes for the first several micro-batches
-2. **Steady State**: Each device alternates between forward and backward passes
-3. **Cooldown Phase**: Backward passes to complete the computation for remaining micro-batches
-The "bubble rate" metric measures the inefficiency in the pipeline, representing the percentage of time devices spend idle waiting for dependencies.
-## References
-- PipeDream: Generalized Pipeline Parallelism for DNN Training (SOSP'19)
-- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism (NeurIPS'19)
-- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

+# Pipeline Parallelism Emulation
+This project provides tools for emulating and visualizing pipeline parallelism strategies used in large language model training.
+## Overview
+Pipeline parallelism is a technique used to train large models by partitioning the model across multiple devices and processing data in a pipelined fashion. This project allows you to:
+- Simulate different pipeline parallelism strategies (1F1B, Interleaved)
+- Visualize the execution schedule on multiple devices
+- Compare different strategies for efficiency
+## Features
+- Supported Pipeline Stragegies:
+    - 1F1B
+    - Interleaved 1F1B
+- Visualization:
+    - Interactive visualization dashboard using Plotly/Dash
+- Config:
+    - Configurable simulation parameters through Hydra
+    - Each stage
+## Installation
+This project uses [uv](https://github.com/astral-sh/uv) for dependency management.
+Setup `uv` if not installed in your computer:
+```
+# On macOS and Linux.
+curl -LsSf https://astral.sh/uv/install.sh | sh
+```
+## Usage
+Running for 1F1B strategy:
 ```bash
+uv run python main.py strategy=1f1b num_devices=4 num_stages=4 num_batches=8
 ```
+```bash
+uv run python main.py strategy=interleave num_devices=4 num_stages=8 num_batches=8
 ```
+## Configuration
+The default configuration is in `conf/config.yaml`. You can override any parameter on the command line or create configuration groups for different scenarios.
+### Using Different Configuration Files
+You can use different configuration files with Hydra in several ways:
+#### Recommended Approach
+1. Create multiple configuration files in the `conf` directory for different use cases:
+   ```
+   conf/
+   ├── config.yaml     # Default configuration
+   └── model_A.yaml    # Create your own config with stage-specific latency for performance projection.
+   ```
+2. Run with your desired configuration using the `--config-name` flag:
+   ```bash
+   uv run python main.py --config-name=model_A
+   ```
+#### Override Specific Parameters
+You can also override specific parameters at runtime:
+```bash
+uv run python main.py op_times.forward=0.5 op_times.backward=1.0 num_batches=6
 ```
+## Project Structure
+```
+PP-Emulation/
+├── conf/                   # Hydra configuration files
+│   └── config.yaml         # Default configuration
+├── src/                    # Source code
+│   ├── __init__.py         # Package initialization
+│   ├── execution_model.py  # Schedule execution models
+│   ├── strategies.py       # Pipeline parallelism strategies
+│   └── visualizer.py       # Visualization utilities
+├── main.py                 # Main entry point
+├── pyproject.toml          # Project metadata and dependencies
+└── README.md               # This file
+```
+## License
+This project is licensed under the MIT License - see the LICENSE file for details.
+## Contributing
+Contributions are welcome! Please feel free to submit a Pull Request.

conf/config.yaml ADDED Viewed

	@@ -0,0 +1,22 @@

+# Default configuration for Pipeline Parallelism Emulation
+num_devices: 4
+num_stages: 4
+num_batches: 12
+visualization_port: 8050
+strategy: "1f1b"  # Options: "1f1b", "interleave"
+p2p_latency: 0.0
+# Operation time configurations
+op_times:
+  # Option 1: Simple configuration (same time for all stages)
+  forward: 1.0
+  backward: 2.0
+  # Option 2: Commented example of stage-specific configuration
+  # forward:
+  #   0: 0.8  # Stage 0 forward time
+  #   1: 1.2  # Stage 1 forward time
+  #   2: 1.5  # Stage 2 forward time
+  #   3: 0.9  # Stage 3 forward time
+  # backward:
+  #   0: 1.0  # Stage 0 backward time

configs/standard.json DELETED Viewed

@@ -1,8 +0,0 @@
-{
-    "num_stages": 4,
-    "num_batches": 8,
-    "forward_times":  [1.0, 1.0, 1.0, 1.0],
-    "backward_times": [2.0, 2.0, 2.0, 2.0],
-    "output_file": "pipeline_1f1b.png",
-    "p2p_time": 0.0
-}

main.py ADDED Viewed

	@@ -0,0 +1,62 @@

+from src.execution_model import ScheduleConfig, ScheduleExecutor
+from src.strategies import generate_1f1b_interleave_schedule, generate_1f1b_schedule
+from src.visualizer import visualize_pipeline_parallelism_dash, save_pipeline_visualization_plotly
+import hydra
+from omegaconf import DictConfig, OmegaConf
+@hydra.main(config_path="conf", config_name="config", version_base=None)
+def main(cfg: DictConfig) -> None:
+    """Run pipeline parallelism simulation with the specified configuration."""
+    print(f"Running with configuration: {cfg}")
+    if cfg.strategy == "1f1b":
+        run_1f1b(cfg)
+    elif cfg.strategy == "interleave":
+        run_interleave(cfg)
+    else:
+        raise ValueError(f"Unknown strategy: {cfg.strategy}")
+def run_1f1b(cfg: DictConfig) -> None:
+    """Run 1F1B pipeline parallelism simulation."""
+    # Convert OmegaConf to dict for op_times if it exists
+    op_times = OmegaConf.to_container(cfg.op_times) if hasattr(cfg, 'op_times') else None
+    schedule_config = ScheduleConfig(
+        num_devices=cfg.num_devices,
+        num_stages=cfg.num_stages,
+        num_batches=cfg.num_batches,
+        p2p_latency=cfg.p2p_latency,
+        op_times=op_times,
+        placement_strategy="1f1b"
+    )
+    schedule = generate_1f1b_schedule(schedule_config)
+    executor = ScheduleExecutor(schedule)
+    executor.execute()
+    visualize_pipeline_parallelism_dash(schedule, port=cfg.visualization_port)
+def run_interleave(cfg: DictConfig) -> None:
+    """Run interleaved pipeline parallelism simulation."""
+    # Convert OmegaConf to dict for op_times if it exists
+    op_times = OmegaConf.to_container(cfg.op_times) if hasattr(cfg, 'op_times') else None
+    schedule_config = ScheduleConfig(
+        num_devices=cfg.num_devices,
+        num_stages=cfg.num_stages,
+        num_batches=cfg.num_batches,
+        p2p_latency=cfg.p2p_latency,
+        placement_strategy="interleave",
+        op_times=op_times
+    )
+    schedule = generate_1f1b_interleave_schedule(schedule_config)
+    executor = ScheduleExecutor(schedule)
+    executor.execute()
+    visualize_pipeline_parallelism_dash(schedule, port=cfg.visualization_port)
+if __name__ == "__main__":
+    main()

pipeline.py DELETED Viewed

@@ -1,491 +0,0 @@
-import argparse
-import json
-import yaml
-import os
-from typing import List, Dict
-# Import visualization function from the new module
-from visualizer import visualize_pipeline_parallelism
-try:
-    from dash_visualizer import visualize_pipeline_parallelism_dash, save_pipeline_visualization_plotly
-    DASH_AVAILABLE = True
-except ImportError:
-    DASH_AVAILABLE = False
-def create_1f1b_schedule(
-    num_stages: int,
-    num_batches: int,
-    forward_times: List[float],
-    backward_times: List[float],
-    p2p_time: float = 0.0,
-) -> Dict[int, List[Dict]]:
-    """
-    Create a 1F1B (One-Forward-One-Backward) schedule for pipeline parallelism.
-    This implementation takes a data-centric approach:
-    1. First determine the operation sequence for each pipeline stage (which microbatch to process when)
-    2. Then calculate timing based on dependencies between operations
-    The 1F1B pattern has three phases:
-    - Warmup: Forward passes for first num_stages microbatches
-    - Steady state: Alternating between forward and backward passes
-    - Cooldown: Backward passes for remaining microbatches
-    Returns:
-        A dictionary mapping device IDs to lists of tasks.
-        Each task is a dictionary with keys:
-        - 'type': 'forward' or 'backward'
-        - 'batch': batch number
-        - 'start_time': start time of the task
-        - 'duration': duration of the task
-    """
-    # Initialize empty schedule
-    schedule = {stage: [] for stage in range(num_stages)}
-    # Step 1: Determine operation sequence for each stage
-    # This will generate the sequence of operations (forward/backward on which microbatch)
-    # that each stage should perform, without timing information yet
-    operation_sequence = determine_1f1b_operation_sequence(num_stages, num_batches)
-    # Step 2: Convert operation sequence to schedule with timing
-    # Taking into account dependencies between operations
-    schedule = calculate_operation_timing(
-        operation_sequence, num_stages, forward_times, backward_times, p2p_time
-    )
-    return schedule
-def determine_1f1b_operation_sequence(
-    num_stages: int, num_batches: int
-) -> Dict[int, List[Dict]]:
-    """
-    Determine the sequence of operations (forward/backward) for each stage in 1F1B scheduling.
-    Args:
-        num_stages: Number of pipeline stages
-        num_batches: Number of micro-batches
-    Returns:
-        Dictionary mapping stage ID to a list of operations in sequence.
-        Each operation is a dict with keys 'type' ('forward' or 'backward') and 'batch'.
-    """
-    operation_sequence = {i: [] for i in range(num_stages)}
-    for current_stage in range(num_stages):
-        warmup_batches = num_stages - current_stage
-        for j in range(1, warmup_batches + 1):
-            operation_sequence[current_stage].append({"type": "forward", "batch": j})
-        steady_batches = num_batches - warmup_batches
-        for j in range(warmup_batches + 1, warmup_batches + steady_batches + 1):
-            operation_sequence[current_stage].append(
-                {"type": "backward", "batch": j - warmup_batches}
-            )
-            operation_sequence[current_stage].append({"type": "forward", "batch": j})
-        for j in range(warmup_batches):
-            operation_sequence[current_stage].append(
-                {"type": "backward", "batch": j + steady_batches + 1}
-            )
-    return operation_sequence
-def calculate_operation_timing(
-    operation_sequence: Dict[int, List[Dict]],
-    num_stages: int,
-    forward_times: List[float],
-    backward_times: List[float],
-    p2p_time: float = 0.0,
-) -> Dict[int, List[Dict]]:
-    """
-    Recursively calculate the specific timing of each operation in a 1F1B schedule.
-    When encountering an operation that depends on a previous operation that hasn't been calculated yet,
-    it will recursively calculate the timing of those operations.
-    Args:
-        operation_sequence: Operation sequence for each stage
-        num_stages: Number of pipeline stages
-        forward_times: Forward propagation time for each stage
-        backward_times: Backward propagation time for each stage
-        p2p_time: Point-to-point communication time between stages
-    Returns:
-        Complete schedule with timing information, each operation includes start_time and duration
-    """
-    # Initialize schedule with timing information
-    schedule = {i: [] for i in range(num_stages)}
-    # For recording already computed operation end times
-    # Format: {(stage, batch, op_type): (start_time, end_time)}
-    computed_ops = {}
-    # For recording the end time of the last operation for each stage
-    stage_last_end_time = [0.0] * num_stages
-    # Helper function: recursively calculate the time for an operation
-    def compute_op_time(stage, batch, op_type):
-        # Check if this operation has already been calculated
-        key = (stage, batch, op_type)
-        if key in computed_ops:
-            return computed_ops[key]
-        # Get operation duration
-        duration = (
-            forward_times[stage] if op_type == "forward" else backward_times[stage]
-        )
-        # Determine start time (dependent on other operations)
-        # 1. Consider sequential dependencies on the stage (must wait for previous operation to complete)
-        start_time = stage_last_end_time[stage]
-        # 2. Forward pass also depends on forward pass of previous stage (if not the first stage)
-        if op_type == "forward" and stage > 0:
-            # Recursively calculate the time for the forward pass of the previous stage (if not calculated yet)
-            prev_stage_key = (stage - 1, batch, "forward")
-            if prev_stage_key not in computed_ops:
-                prev_start, prev_end = compute_op_time(stage - 1, batch, "forward")
-            else:
-                _, prev_end = computed_ops[prev_stage_key]
-            # Update start time
-            start_time = max(start_time, prev_end + p2p_time)
-        # 3. Backward pass depends on:
-        elif op_type == "backward":
-            # a. Forward pass of the same stage
-            same_stage_forward_key = (stage, batch, "forward")
-            if same_stage_forward_key not in computed_ops:
-                _, forward_end = compute_op_time(stage, batch, "forward")
-            else:
-                _, forward_end = computed_ops[same_stage_forward_key]
-            start_time = max(start_time, forward_end)
-            # b. Backward pass of the next stage (if not the last stage)
-            if stage < num_stages - 1:
-                next_stage_backward_key = (stage + 1, batch, "backward")
-                if next_stage_backward_key not in computed_ops:
-                    _, next_backward_end = compute_op_time(stage + 1, batch, "backward")
-                else:
-                    _, next_backward_end = computed_ops[next_stage_backward_key]
-                start_time = max(start_time, next_backward_end + p2p_time)
-        # Calculate end time
-        end_time = start_time + duration
-        # Store calculation results
-        computed_ops[key] = (start_time, end_time)
-        # Update the end time of the last operation for this stage
-        stage_last_end_time[stage] = end_time
-        return start_time, end_time
-    # Calculate time for each operation in the operation_sequence
-    for i in range(len(operation_sequence[0])):
-        for stage in range(num_stages):
-            batch = operation_sequence[stage][i]["batch"]
-            op_type = operation_sequence[stage][i]["type"]
-            # Recursively calculate the time for this operation
-            start_time, _ = compute_op_time(stage, batch, op_type)
-            # Fill in scheduling information
-            op_with_timing = operation_sequence[stage][i].copy()
-            op_with_timing["start_time"] = start_time
-            op_with_timing["duration"] = (
-                forward_times[stage] if op_type == "forward" else backward_times[stage]
-            )
-            schedule[stage].append(op_with_timing)
-    return schedule
-def get_schedule_info(schedule: Dict[int, List[Dict]]):
-    num_stages = len(schedule)
-    max_time = 0
-    for device in schedule:
-        for task in schedule[device]:
-            end_time = task["start_time"] + task["duration"]
-            if end_time > max_time:
-                max_time = end_time
-    total_execution_time = max_time * num_stages
-    total_computation_time = 0
-    device_computation_times = {}
-    for device in schedule:
-        device_computation_time = 0
-        for task in schedule[device]:
-            device_computation_time += task["duration"]
-        device_computation_times[device] = device_computation_time
-        total_computation_time += device_computation_time
-    bubble_rate = (
-        total_execution_time - total_computation_time
-    ) / total_computation_time
-    return {
-        "bubble_rate": f"{bubble_rate*100:.2f}%",
-        "execution_time": f"{max_time / 1000:.2f} s",
-    }
-def read_config_file(config_path):
-    """
-    Read configuration from a JSON or YAML file.
-    Args:
-        config_path: Path to the config file (JSON or YAML)
-    Returns:
-        Dictionary containing configuration parameters
-    """
-    if not os.path.exists(config_path):
-        raise FileNotFoundError(f"Config file not found: {config_path}")
-    file_ext = os.path.splitext(config_path)[1].lower()
-    try:
-        with open(config_path, "r") as f:
-            if file_ext == ".json":
-                config = json.load(f)
-            elif file_ext in (".yaml", ".yml"):
-                config = yaml.safe_load(f)
-            else:
-                raise ValueError(
-                    f"Unsupported config file format: {file_ext}. Use .json, .yaml, or .yml"
-                )
-        return config
-    except Exception as e:
-        raise ValueError(f"Error reading config file: {str(e)}")
-def parse_args():
-    """
-    Parse command-line arguments for the pipeline parallelism tool.
-    Returns:
-        Parsed arguments namespace
-    """
-    parser = argparse.ArgumentParser(
-        description="Pipeline Parallelism Scheduler and Visualizer",
-        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
-    )
-    # Config file option
-    parser.add_argument(
-        "--config", "-c", type=str, help="Path to config file (JSON or YAML)"
-    )
-    # Main parameters
-    parser.add_argument(
-        "--num-stages",
-        "-s",
-        type=int,
-        default=0,
-        help="Number of pipeline stages (devices)",
-    )
-    parser.add_argument(
-        "--num-batches", "-b", type=int, default=0, help="Number of micro-batches"
-    )
-    # Forward and backward times
-    parser.add_argument(
-        "--forward-times",
-        "-f",
-        type=float,
-        nargs="+",
-        help="Time for forward pass at each stage (space-separated list)",
-    )
-    parser.add_argument(
-        "--backward-times",
-        "-bw",
-        type=float,
-        nargs="+",
-        help="Time for backward pass at each stage (space-separated list)",
-    )
-    # Output options
-    parser.add_argument(
-        "--output",
-        "-o",
-        type=str,
-        default="pipeline_1f1b.png",
-        help="Output file path for visualization",
-    )
-    parser.add_argument(
-        "--no-visualization", action="store_true", help="Skip visualization generation"
-    )
-    parser.add_argument(
-        "--p2p-time",
-        type=float,
-        default=0.0,
-        help="Time for point-to-point communication between stages",
-    )
-    parser.add_argument("--visualizer", choices=["matplotlib", "dash", "dash-interactive"],
-                        default="matplotlib", help="Visualization library to use")
-    return parser.parse_args()
-def example_usage():
-    """Example usage of the visualization function and testing the scheduling algorithms."""
-    # Example parameters
-    num_stages = 4  # Number of pipeline stages (devices)
-    num_batches = 10  # Number of micro-batches
-    # Example times for forward and backward passes for each stage
-    forward_times = [1.0, 1.0, 1.0, 1.0]  # Time for forward pass at each stage
-    backward_times = [2.0, 2.0, 2.0, 2.0]  # Time for backward pass at each stage
-    # Create 1F1B schedule
-    schedule = create_1f1b_schedule(
-        num_stages=num_stages,
-        num_batches=num_batches,
-        forward_times=forward_times,
-        backward_times=backward_times,
-    )
-    # Create visualization with the schedule
-    visualize_pipeline_parallelism(
-        schedule=schedule, schedule_type="1f1b", output_file="pipeline_1f1b.png"
-    )
-    # Analyze the schedule
-    schedule_info = get_schedule_info(schedule)
-    print(schedule_info)
-def main():
-    """
-    Main function that parses arguments and runs the pipeline parallelism analysis.
-    """
-    args = parse_args()
-    # Initialize with default values
-    num_stages = 4
-    num_batches = 10
-    forward_times = None
-    backward_times = None
-    output_file = "pipeline_1f1b.png"
-    p2p_time = 0.0
-    # Command line arguments override config file
-    num_stages = args.num_stages
-    num_batches = args.num_batches
-    forward_times = args.forward_times
-    backward_times = args.backward_times
-    output_file = args.output
-    p2p_time = args.p2p_time
-    # Read from config file if provided
-    if args.config:
-        try:
-            print(f"Reading configuration from {args.config}")
-            config = read_config_file(args.config)
-            # Update parameters from config
-            num_stages = config.get("num_stages", num_stages)
-            num_batches = config.get("num_batches", num_batches)
-            forward_times = config.get("forward_times")
-            backward_times = config.get("backward_times")
-            output_file = config.get("output_file", output_file)
-            p2p_time = config.get("p2p_time", 0.0)
-        except Exception as e:
-            print(f"Error reading config file: {str(e)}")
-            print("Falling back to command line arguments or defaults")
-    # Validate inputs
-    if forward_times is None:
-        forward_times = [1.0] * num_stages
-    elif len(forward_times) != num_stages:
-        print(
-            f"Warning: forward_times length ({len(forward_times)}) doesn't match num_stages ({num_stages})"
-        )
-        if len(forward_times) < num_stages:
-            # Extend with repeats of the last value
-            forward_times = list(forward_times) + [forward_times[-1]] * (
-                num_stages - len(forward_times)
-            )
-        else:
-            # Truncate
-            forward_times = forward_times[:num_stages]
-        print(f"Adjusted forward_times: {forward_times}")
-    if backward_times is None:
-        backward_times = [2.0] * num_stages
-    elif len(backward_times) != num_stages:
-        print(
-            f"Warning: backward_times length ({len(backward_times)}) doesn't match num_stages ({num_stages})"
-        )
-        if len(backward_times) < num_stages:
-            # Extend with repeats of the last value
-            backward_times = list(backward_times) + [backward_times[-1]] * (
-                num_stages - len(backward_times)
-            )
-        else:
-            # Truncate
-            backward_times = backward_times[:num_stages]
-        print(f"Adjusted backward_times: {backward_times}")
-    print(f"Running with parameters:")
-    print(f"  num_stages: {num_stages}")
-    print(f"  num_batches: {num_batches}")
-    print(f"  forward_times: {forward_times}")
-    print(f"  backward_times: {backward_times}")
-    print(f"  output_file: {output_file}")
-    # Create 1F1B schedule
-    schedule = create_1f1b_schedule(
-        num_stages=num_stages,
-        num_batches=num_batches,
-        forward_times=forward_times,
-        backward_times=backward_times,
-        p2p_time=p2p_time,
-    )
-    # Create visualization unless --no-visualization is specified
-    if not args.no_visualization:
-        if args.visualizer == "matplotlib" or not DASH_AVAILABLE:
-            if not DASH_AVAILABLE and args.visualizer in ["dash", "dash-interactive"]:
-                print("Warning: Dash not available. Falling back to matplotlib.")
-            visualize_pipeline_parallelism(
-                schedule=schedule, schedule_type="1f1b", output_file=output_file
-            )
-        elif args.visualizer == "dash":
-            # Get output file name without extension to use the appropriate extension
-            output_base = os.path.splitext(output_file)[0]
-            output_dash = f"{output_base}_plotly.png"
-            save_pipeline_visualization_plotly(
-                schedule=schedule, schedule_type="1f1b", output_file=output_dash
-            )
-        elif args.visualizer == "dash-interactive":
-            print("Using Dash interactive visualization")
-            visualize_pipeline_parallelism_dash(
-                schedule=schedule, schedule_type="1f1b", port=8050, debug=False
-            )
-    # Analyze the schedule
-    schedule_info = get_schedule_info(schedule)
-    print(schedule_info)
-    return {
-        "schedule": schedule,
-        "schedule_info": schedule_info,
-        "num_stages": num_stages,
-        "num_batches": num_batches,
-    }
-if __name__ == "__main__":
-    main()

pipeline_1f1b.png DELETED Viewed

Git LFS Details

SHA256: ff047349dfa8f855aca47e233be6a5b12b45441c7f45bbe69509d0602dc1a127
Pointer size: 131 Bytes
Size of remote file: 107 kB

pyproject.toml ADDED Viewed

	@@ -0,0 +1,67 @@

+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[project]
+name = "pp-emulation"
+version = "0.1.0"
+description = "Pipeline Parallelism Emulation and Visualization"
+readme = "README.md"
+requires-python = ">=3.10"
+authors = [
+    {name = "Project Author"}
+]
+classifiers = [
+    "Programming Language :: Python :: 3",
+    "License :: OSI Approved :: MIT License",
+    "Operating System :: OS Independent",
+]
+dependencies = [
+    "dash>=2.14.0",
+    "hydra-core>=1.3.2",
+    "omegaconf>=2.3.0",
+    "plotly>=5.18.0",
+    "pandas>=2.1.0",
+    "numpy>=1.26.0",
+    "tqdm>=4.67.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=7.4.0",
+    "black>=23.7.0",
+    "isort>=5.12.0",
+    "mypy>=1.5.1",
+]
+# Add Hatch configuration to explicitly define where source code is located
+[tool.hatch.build.targets.wheel]
+packages = ["src"]
+[tool.hatch.build.targets.sdist]
+include = [
+    "src",
+    "main.py",
+    "conf",
+    "LICENSE",
+    "README.md",
+]
+[tool.black]
+line-length = 88
+target-version = ["py310"]
+[tool.isort]
+profile = "black"
+line_length = 88
+[tool.mypy]
+python_version = "3.10"
+warn_return_any = true
+warn_unused_configs = true
+disallow_untyped_defs = true
+disallow_incomplete_defs = true
+[tool.pytest]
+testpaths = ["tests"]
+pythonpath = ["."]

requirements-dash.txt DELETED Viewed

@@ -1,5 +0,0 @@
-dash==2.13.0
-plotly==5.18.0
-numpy
-kaleido # For static image export
-tqdm # For progress bars

src/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ """Pipeline Parallelism Emulation and Visualization package."""
2	+
3	+ __version__ = "0.1.0"

src/execution_model.py ADDED Viewed

	@@ -0,0 +1,219 @@

+from collections import defaultdict
+from typing import Dict, List, Optional, Union
+class Operation:
+    """Operation is a single operation in the pipeline."""
+    def __init__(self, batch_id: int, stage_id: int, op_type: str):
+        self.batch_id = batch_id
+        self.stage_id = stage_id
+        self.op_type = op_type
+        self.device_id = None
+        self.start_time = None
+        self.end_time = None
+class DeviceQueue:
+    def __init__(self, stages: List[int], device_id: int):
+        self.stages = stages
+        self.device_id = device_id
+        self.ops = []  # List of operations
+    def add_operation(self, op: Operation):
+        assert op.stage_id in self.stages
+        self.ops.append(op)
+        assert op.device_id is None
+        op.device_id = self.device_id
+class ScheduleConfig:
+    def __init__(
+        self,
+        num_devices: int,
+        num_stages: int,
+        num_batches: int,
+        p2p_latency: float = 0.0,
+        placement_strategy: str = "normal",
+        op_times: Optional[Dict[str, Union[float, Dict[int, float]]]] = None,
+    ):
+        self.num_devices = num_devices
+        self.num_stages = num_stages
+        self.num_batches = num_batches
+        self.p2p_latency = p2p_latency
+        self.placement_strategy = placement_strategy
+        # Initialize default operation times
+        self.op_times = {
+            "forward": 1.0,
+            "backward": 2.0,
+        }
+        # Update with user-provided operation times
+        if op_times:
+            for op_type, times in op_times.items():
+                if isinstance(times, dict):
+                    # If a dict is provided, it maps stage_id -> time
+                    if op_type not in self.op_times:
+                        self.op_times[op_type] = {}
+                    elif not isinstance(self.op_times[op_type], dict):
+                        # Convert float to dict if needed
+                        self.op_times[op_type] = {i: self.op_times[op_type] for i in range(num_stages)}
+                    # Update with provided stage-specific times
+                    for stage_id, time in times.items():
+                        if not isinstance(self.op_times[op_type], dict):
+                            self.op_times[op_type] = {i: self.op_times[op_type] for i in range(num_stages)}
+                        self.op_times[op_type][stage_id] = time
+                else:
+                    # If a float is provided, use same time for all stages
+                    self.op_times[op_type] = times
+        assert num_stages % num_devices == 0, "num_stages must be divisible by num_devices"
+        self.num_stages_per_device = num_stages // num_devices
+        self.init_device_to_stages()
+        assert (
+            sum(len(stages) for stages in self.device_to_stages.values()) == num_stages
+        )
+    def init_device_to_stages(self):
+        if self.placement_strategy == "normal":
+            # Evenly distributed
+            stages_per_device = self.num_stages // self.num_devices
+            self.device_to_stages = defaultdict(list)
+            for i in range(self.num_stages):
+                device_to_put = i // stages_per_device
+                self.device_to_stages[device_to_put].append(i)
+        elif self.placement_strategy == "interleave":
+            self.device_to_stages = defaultdict(list)
+            for i in range(self.num_stages):
+                device_to_put = i % self.num_devices
+                self.device_to_stages[device_to_put].append(i)
+        else:
+            raise ValueError(f"Invalid placement strategy: {self.placement_strategy}")
+    def get_op_time(self, op_type: str, stage_id: int):
+        if op_type not in self.op_times:
+            raise ValueError(f"Invalid operation type: {op_type}")
+        times = self.op_times[op_type]
+        if isinstance(times, dict):
+            # If we have stage-specific times, use those
+            if stage_id not in times:
+                raise ValueError(f"No time specified for operation {op_type} at stage {stage_id}")
+            return times[stage_id]
+        else:
+            # If we have a single float, use the same value for all stages
+            return times
+class Schedule:
+    def __init__(self, config: ScheduleConfig):
+        self.ops = {}  # (batch_id, stage_id, op_type) -> Operation
+        self.dev_queues: List[DeviceQueue] = []
+        for dev_id in range(config.num_devices):
+            self.dev_queues.append(DeviceQueue(config.device_to_stages[dev_id], dev_id))
+        self.config = config
+        self.init_operations()
+    def init_operations(self, op_types: Optional[List[str]] = None):
+        if op_types is None:
+            op_types = ["forward", "backward"]
+        for batch_id in range(self.config.num_batches):
+            for stage_id in range(self.config.num_stages):
+                for op_type in op_types:
+                    self.ops[(batch_id, stage_id, op_type)] = Operation(
+                        batch_id, stage_id, op_type
+                    )
+    def get_op(self, batch_id: int, stage_id: int, op_type: str):
+        return self.ops[(batch_id, stage_id, op_type)]
+    def get_dependencies(self, op: Operation):
+        deps = []
+        if op.op_type == "forward":
+            if op.stage_id > 0:
+                deps.append(
+                    (
+                        self.get_op(op.batch_id, op.stage_id - 1, "forward"),
+                        self.config.p2p_latency,
+                    )
+                )
+        elif op.op_type == "backward":
+            if op.stage_id < self.config.num_stages - 1:
+                deps.append(
+                    (
+                        self.get_op(op.batch_id, op.stage_id + 1, "backward"),
+                        self.config.p2p_latency,
+                    )
+                )
+        device_index = self.dev_queues[op.device_id].ops.index(op)
+        if device_index > 0:
+            deps.append((self.dev_queues[op.device_id].ops[device_index - 1], 0.0))
+        return deps
+    def show(self):
+        """Display detailed information about the schedule for debugging purposes."""
+        print("\n=== SCHEDULE DETAILS ===")
+        print(f"Devices: {self.config.num_devices}, Stages: {self.config.num_stages}, Batches: {self.config.num_batches}")
+        print(f"Placement Strategy: {self.config.placement_strategy}")
+        print("\n=== DEVICE QUEUES ===")
+        for dev_id in range(self.config.num_devices):
+            print(f"\nDEVICE {dev_id} (Stages: {self.dev_queues[dev_id].stages}):")
+            print("-" * 80)
+            print(f"{'Batch':^6} | {'Stage':^6} | {'Type':^10} | {'Start':^10} | {'End':^10} | {'Duration':^10}")
+            print("-" * 80)
+            for op in self.dev_queues[dev_id].ops:
+                op_type = "Forward" if op.op_type == "forward" else "Backward"
+                start = f"{op.start_time:.2f}" if op.start_time is not None else "N/A"
+                end = f"{op.end_time:.2f}" if op.end_time is not None else "N/A"
+                duration = "N/A"
+                if op.start_time is not None and op.end_time is not None:
+                    duration = f"{op.end_time - op.start_time:.2f}"
+                print(f"{op.batch_id:^6} | {op.stage_id:^6} | {op_type:^10} | {start:^10} | {end:^10} | {duration:^10}")
+        # Find the total execution time (if timing info is available)
+        if all(op.end_time is not None for op in self.ops.values()):
+            total_time = max(op.end_time for op in self.ops.values())
+            print(f"\nTotal execution time: {total_time:.2f}")
+class ScheduleExecutor:
+    def __init__(self, schedule: Schedule):
+        self.schedule = schedule
+    def execute(self):
+        def execute_op(op: Operation):
+            deps = self.schedule.get_dependencies(op)
+            if len(deps) == 0:
+                op.start_time = 0.0
+            else:
+                for dep, gap in deps:
+                    if dep.end_time is None or dep.start_time is None:
+                        execute_op(dep)
+                op.start_time = max(dep.end_time + gap for dep, gap in deps)
+            op.end_time = op.start_time + self.schedule.config.get_op_time(
+                op.op_type, op.stage_id
+            )
+        op_num = len(self.schedule.dev_queues[0].ops)
+        for i in range(op_num):
+            for dev_id in range(self.schedule.config.num_devices):
+                op = self.schedule.dev_queues[dev_id].ops[i]
+                execute_op(op)
+        for op in self.schedule.ops.values():
+            assert (
+                op.start_time is not None
+            ), f"op {op.batch_id}, {op.stage_id}, {op.op_type} has no start time"
+            assert (
+                op.end_time is not None
+            ), f"op {op.batch_id}, {op.stage_id}, {op.op_type} has no end time"

src/strategies.py ADDED Viewed

	@@ -0,0 +1,192 @@

+from collections import defaultdict
+from src.execution_model import Schedule, ScheduleConfig
+def generate_1f1b_schedule(config: ScheduleConfig):
+    schedule = Schedule(config)
+    for i in range(config.num_devices):
+        fwd_batch_id = 0
+        bwd_batch_id = 0
+        cooldown_batches = warmup_batches = config.num_devices - i - 1
+        steady_batches = config.num_batches - warmup_batches
+        for _ in range(warmup_batches):
+            for j in range(len(schedule.dev_queues[i].stages)):
+                schedule.dev_queues[i].add_operation(
+                    schedule.get_op(fwd_batch_id, schedule.dev_queues[i].stages[j], "forward")
+                )
+            fwd_batch_id += 1
+        for _ in range(steady_batches):
+            for j in range(len(schedule.dev_queues[i].stages)):
+                schedule.dev_queues[i].add_operation(
+                    schedule.get_op(fwd_batch_id, schedule.dev_queues[i].stages[j], "forward")
+                )
+            fwd_batch_id += 1
+            for j in range(len(schedule.dev_queues[i].stages)-1, -1, -1):
+                schedule.dev_queues[i].add_operation(
+                    schedule.get_op(bwd_batch_id, schedule.dev_queues[i].stages[j], "backward")
+                )
+            bwd_batch_id += 1
+        for _ in range(cooldown_batches):
+            for j in range(len(schedule.dev_queues[i].stages)-1, -1, -1):
+                schedule.dev_queues[i].add_operation(
+                    schedule.get_op(bwd_batch_id, schedule.dev_queues[i].stages[j], "backward")
+                )
+            bwd_batch_id += 1
+    return schedule
+# Some codes are copied from Megatron-LM
+def generate_1f1b_interleave_schedule(config: ScheduleConfig):
+    schedule = Schedule(config)
+    def get_pp_rank_microbatches(
+        num_microbatches,
+        num_devices,
+        device_id,
+        num_stages_per_device,
+        microbatch_group_size_per_vp_stage,
+    ):
+        """Get the number of total, warmup, and remaining microbatches in PP scheduling."""
+        total_num_microbatches = num_microbatches * num_stages_per_device
+        are_all_microbatches_in_warmup = False
+        if num_devices > 1:
+            if num_stages_per_device is None:
+                # forward_backward_pipelining_without_interleaving
+                num_warmup_microbatches = num_devices - device_id - 1
+            else:
+                # forward_backward_pipelining_with_interleaving
+                # Run (num_model_chunks-1)*microbatch_group_size_per_vp_stage on
+                # all workers, followed by more microbatches after depending on
+                # stage ID (more forward passes for earlier stages, later stages can
+                # immediately start with 1F1B).
+                num_warmup_microbatches = (num_devices - device_id - 1) * 2
+                num_warmup_microbatches += (num_stages_per_device - 1) * microbatch_group_size_per_vp_stage
+        else:
+            # forward_backward_no_pipelining
+            num_warmup_microbatches = 1
+        if num_warmup_microbatches >= total_num_microbatches:
+            num_warmup_microbatches = total_num_microbatches
+            are_all_microbatches_in_warmup = True
+        num_microbatches_remaining = total_num_microbatches - num_warmup_microbatches
+        return (
+            total_num_microbatches,
+            are_all_microbatches_in_warmup,
+            num_warmup_microbatches,
+            num_microbatches_remaining,
+        )
+    def get_schedule_table(num_microbatches, num_model_chunks, microbatch_group_size_per_vp_stage):
+        """Get the schedule table for PP scheduling.
+        Create a tunable schedule lookup table.
+        The schedule lookup table uses the virtual_microbatch_id to find the corresponding microbatch_id and model_chunk_id.
+        For example, the tunable schedule table for PP2 N3M5 with VP2 is constructed as below:
+        virtual_microbatch_id | 0 1 2 3 4 5 6 7 8 9
+        microbatch_id         | 0 1 2 0 1 2 3 4 3 4
+        model_chunk_id        | 0 0 0 1 1 1 0 0 1 1
+        """
+        schedule_table = []
+        for min_microbatch_id_in_group in range(
+            0, num_microbatches, microbatch_group_size_per_vp_stage
+        ):
+            if min_microbatch_id_in_group + microbatch_group_size_per_vp_stage >= num_microbatches:
+                # Construct schedule for the last microbatch group
+                schedule_table.extend(
+                    [
+                        (microbatch_id, model_chunk_id)
+                        for model_chunk_id in range(num_model_chunks)
+                        for microbatch_id in range(min_microbatch_id_in_group, num_microbatches)
+                    ]
+                )
+            else:
+                # Construct schedule for other microbatch groups
+                schedule_table.extend(
+                    [
+                        (microbatch_id, model_chunk_id)
+                        for model_chunk_id in range(num_model_chunks)
+                        for microbatch_id in range(
+                            min_microbatch_id_in_group,
+                            min_microbatch_id_in_group + microbatch_group_size_per_vp_stage,
+                        )
+                    ]
+                )
+        return schedule_table
+    def convert_schedule_table_to_order(num_warmup_microbatches, num_model_chunks, schedule_table):
+        """Convert a tunable schedule lookup table to the te.make_graphed_callables() accepted
+        order format. For example, the tunable schedule table for PP2 N3M5 with VP2 is as below:
+        virtual_microbatch_id | 0 1 2 3 4 5 6 7 8 9
+        microbatch_id         | 0 1 2 0 1 2 3 4 3 4
+        model_chunk_id        | 0 0 0 1 1 1 0 0 1 1
+        Then the forward backward separated order is:
+        forward               | 1 1 1 2 2 2 1 1 2 2
+        backward              | -2 -2 -2 -1 -1 -1 -2 -2 -1 -1
+        If num_warmup_microbatches is 5, the output order is:
+        1 1 1 2 2 2 -2 1 -2 1 -2 2 -1 2 -1 -1 -2 -2 -1 -1
+        """
+        _, model_chunk_id_table = zip(*schedule_table)
+        forward_order = [chunk_id + 1 for chunk_id in model_chunk_id_table]
+        backward_order = [chunk_id - num_model_chunks for chunk_id in model_chunk_id_table]
+        order = forward_order[:num_warmup_microbatches]
+        for i in range(num_warmup_microbatches, len(forward_order)):
+            order.append(forward_order[i])
+            order.append(backward_order[i - num_warmup_microbatches])
+        if num_warmup_microbatches > 0:
+            order.extend(backward_order[-num_warmup_microbatches:])
+        return order
+    for device_id in range(config.num_devices):
+        microbatch_group_size_per_vp_stage = config.num_devices
+        total_num_microbatches, are_all_microbatches_in_warmup, num_warmup_microbatches, num_microbatches_remaining = get_pp_rank_microbatches(
+            config.num_batches,
+            config.num_devices,
+            device_id,
+            config.num_stages_per_device,
+            microbatch_group_size_per_vp_stage,
+        )
+        schedule_table = get_schedule_table(
+            config.num_batches,
+            config.num_stages_per_device,
+            microbatch_group_size_per_vp_stage,
+        )
+        order = convert_schedule_table_to_order(
+            num_warmup_microbatches,
+            num_model_chunks=config.num_stages_per_device,
+            schedule_table=schedule_table,
+        )
+        cur_stage_microbatch_id = {}
+        for i in range(1, config.num_stages_per_device+1):
+            cur_stage_microbatch_id[i] = 0
+            cur_stage_microbatch_id[-i] = 0
+        for order_item in order:
+            stage_id = schedule.dev_queues[device_id].stages[abs(order_item)-1]
+            if order_item > 0:
+                op_type = "forward"
+                micro_batch_id = cur_stage_microbatch_id[order_item]
+                cur_stage_microbatch_id[order_item] = cur_stage_microbatch_id[order_item] + 1
+            elif order_item < 0:
+                op_type = "backward"
+                micro_batch_id = cur_stage_microbatch_id[order_item]
+                cur_stage_microbatch_id[order_item] = cur_stage_microbatch_id[order_item] + 1
+            else:
+                raise ValueError(f"Invalid order item: {order_item}")
+            schedule.dev_queues[device_id].add_operation(
+                schedule.get_op(micro_batch_id, stage_id, op_type)
+            )
+    return schedule

dash_visualizer.py → src/visualizer.py RENAMED Viewed

@@ -1,41 +1,86 @@
 import dash
 from dash import dcc, html
-from dash.dependencies import Input, Output, State
 import plotly.graph_objects as go
-import numpy as np
-from typing import List, Dict, Literal
 from tqdm import tqdm
-import time
-def create_pipeline_figure(schedule: Dict[int, List[Dict]], max_time=None, show_progress=True):
     """
     Create a Plotly figure for pipeline parallelism scheduling.
     Args:
-        schedule: Dictionary mapping device IDs to lists of tasks.
-                 Each task is a dictionary with keys:
-                 - 'type': 'forward', 'backward', or 'optimizer'
-                 - 'batch': batch number
-                 - 'start_time': start time of the task
-                 - 'duration': duration of the task
         max_time: Optional maximum time to display
         show_progress: Whether to show a progress bar
     """
-    # Colors for task types
-    forward_color = "royalblue"
-    backward_color = "sandybrown"
-    optimizer_color = "#FFEFCF"
     empty_color = "whitesmoke"
-    # Find the number of stages (devices)
-    num_stages = len(schedule)
     # Find the maximum time in the schedule if not provided
     if max_time is None:
         max_time = 0
-        for device in schedule:
-            for task in schedule[device]:
                 end_time = task["start_time"] + task["duration"]
                 if end_time > max_time:
                     max_time = end_time
@@ -44,56 +89,51 @@ def create_pipeline_figure(schedule: Dict[int, List[Dict]], max_time=None, show_
     fig = go.Figure()
     # Initialize progress tracking
-    total_tasks = sum(len(tasks) for tasks in schedule.values())
     tasks_processed = 0
     if show_progress:
-        progress_bar = tqdm(total=total_tasks + num_stages + 3, desc="Creating visualization")
-    # Add background for empty cells
-    for device_idx in range(num_stages):
-        device_idx_reversed = num_stages - device_idx - 1  # Reverse for plotting
-        fig.add_trace(go.Scatter(
-            x=[0, max_time],
-            y=[device_idx_reversed, device_idx_reversed],
-            mode='lines',
-            line=dict(color='lightgray', width=0.5),
-            showlegend=False,
-            hoverinfo='none'
-        ))
-        if show_progress:
-            progress_bar.update(1)
     # Add rectangles for each task
-    for device_idx, device in enumerate(schedule):
-        device_idx_reversed = num_stages - device_idx - 1
-        for task in schedule[device]:
             # Determine task color and text color
             if task["type"] == "forward":
-                color = forward_color
                 text_color = "white"
                 name = "Forward"
             elif task["type"] == "backward":
-                color = backward_color
                 text_color = "black"
                 name = "Backward"
-            else:  # optimizer or any other type
-                color = optimizer_color
                 text_color = "black"
-                name = "Optimizer step"
             # Add rectangle for the task
             start_time = task["start_time"]
             duration = task["duration"]
             # Create rectangle using shape
             fig.add_shape(
                 type="rect",
                 x0=start_time,
-                y0=device_idx_reversed - 0.4,
                 x1=start_time + duration,
-                y1=device_idx_reversed + 0.4,
                 line=dict(color="black", width=0.5),
                 fillcolor=color,
                 layer="above",
@@ -102,12 +142,23 @@ def create_pipeline_figure(schedule: Dict[int, List[Dict]], max_time=None, show_
             # Add batch number text
             fig.add_annotation(
                 x=start_time + duration / 2,
-                y=device_idx_reversed,
-                text=str(task["batch"]),
                 showarrow=False,
-                font=dict(color=text_color, size=10, family="Arial, bold"),
             )
             # Update progress
             if show_progress:
                 tasks_processed += 1
@@ -115,9 +166,8 @@ def create_pipeline_figure(schedule: Dict[int, List[Dict]], max_time=None, show_
     # Add custom legend
     legend_items = [
-        dict(name="Forward", color=forward_color),
-        dict(name="Backward", color=backward_color),
-        dict(name="Optimizer step", color=optimizer_color)
     ]
     for i, item in enumerate(legend_items):
@@ -133,77 +183,98 @@ def create_pipeline_figure(schedule: Dict[int, List[Dict]], max_time=None, show_
             progress_bar.update(1)
     # Set axis properties
-    device_labels = [f"Device {i+1}" for i in range(num_stages)]
-    device_labels.reverse()  # Reverse to put Device 1 at the top
     fig.update_layout(
-        xaxis=dict(
-            showticklabels=False,
-            showgrid=False,
-            zeroline=False,
-            title="Time →",
-            range=[0, max_time + 0.5]
-        ),
         yaxis=dict(
             tickmode="array",
-            tickvals=list(range(num_stages)),
             ticktext=device_labels,
             showgrid=False,
             zeroline=False,
-            range=[-0.5, num_stages - 0.5]
         ),
-        margin=dict(l=50, r=50, t=50, b=50),
         plot_bgcolor="white",
         legend=dict(
             orientation="h",
-            yanchor="bottom",
-            y=-0.2,
             xanchor="center",
             x=0.5
-        )
     )
     if show_progress:
-        progress_bar.update(1)  # Final update for layout
         progress_bar.close()
     return fig
-def create_dash_app(schedule: Dict[int, List[Dict]], schedule_type="1f1b"):
     """
-    Create a Dash app for interactive visualization of pipeline scheduling.
     Args:
-        schedule: Dictionary mapping device IDs to lists of tasks
-        schedule_type: Type of scheduling algorithm used
     """
-    app = dash.Dash(__name__, title="Pipeline Parallelism Visualization")
     app.layout = html.Div([
-        html.H1(f"Pipeline Parallelism Visualization ({schedule_type.upper()})",
-                style={'textAlign': 'center'}),
-        html.Div(id="loading-container", children=[
-            dcc.Loading(
-                id="loading-graph",
-                type="circle",
-                children=[
-                    html.Div(id="graph-container", children=[
-                        dcc.Graph(
-                            id='pipeline-graph',
-                            style={'height': '600px'}
-                        )
-                    ])
-                ]
-            )
-        ]),
         html.Div([
-            html.Button("Download PNG", id="btn-download",
-                      style={'margin': '10px'}),
-            dcc.Download(id="download-image")
-        ], style={'textAlign': 'center', 'marginTop': '20px'})
     ])
     @app.callback(
@@ -213,98 +284,65 @@ def create_dash_app(schedule: Dict[int, List[Dict]], schedule_type="1f1b"):
     )
     def load_graph(_):
         # Create the figure when the app loads
-        return create_pipeline_figure(schedule, show_progress=True)
     @app.callback(
         Output("download-image", "data"),
         Input("btn-download", "n_clicks"),
         prevent_initial_call=True,
     )
     def download_image(n_clicks):
-        # Show progress in terminal for downloads
-        fig = create_pipeline_figure(schedule, show_progress=True)
-        img_bytes = fig.to_image(format="png", scale=3)
         return dict(
-            content=img_bytes,
-            filename="pipeline_visualization.png"
         )
     return app
 def visualize_pipeline_parallelism_dash(
-    schedule: Dict[int, List[Dict]],
-    schedule_type: Literal["simple", "1f1b"] = "1f1b",
     port: int = 8050,
     debug: bool = False
 ):
     """
-    Create an interactive Dash visualization for pipeline parallelism scheduling.
     Args:
-        schedule: Dictionary mapping device IDs to lists of tasks
-        schedule_type: Type of scheduling algorithm used ("simple" or "1f1b")
-        port: Port number to run the Dash app
-        debug: Whether to run the app in debug mode
     """
-    app = create_dash_app(schedule, schedule_type)
     print(f"Starting Dash app on http://localhost:{port}/")
     app.run_server(debug=debug, port=port)
 def save_pipeline_visualization_plotly(
-    schedule: Dict[int, List[Dict]],
-    schedule_type: Literal["simple", "1f1b"] = "1f1b",
     output_file: str = "pipeline_visualization_plotly.png",
 ):
     """
-    Save a static Plotly visualization of pipeline parallelism scheduling.
     Args:
-        schedule: Dictionary mapping device IDs to lists of tasks
-        schedule_type: Type of scheduling algorithm used
-        output_file: Path to save the visualization
     """
-    print(f"Creating visualization for {len(schedule)} devices...")
-    fig = create_pipeline_figure(schedule, show_progress=True)
-    # Update layout for static image
-    fig.update_layout(
-        title=f"Pipeline Parallelism Visualization ({schedule_type.upper()})",
-        title_x=0.5
-    )
-    print(f"Saving image to {output_file}...")
-    # Save as image
-    fig.write_image(output_file, scale=3)
     print(f"Visualization saved to {output_file}")
-if __name__ == "__main__":
-    # Example usage
-    import argparse
-    from pipeline import create_1f1b_schedule
-    parser = argparse.ArgumentParser(description="Pipeline Parallelism Visualizer")
-    parser.add_argument("--num-stages", type=int, default=4, help="Number of pipeline stages")
-    parser.add_argument("--num-batches", type=int, default=8, help="Number of microbatches")
-    parser.add_argument("--interactive", action="store_true", help="Run interactive Dash app")
-    parser.add_argument("--port", type=int, default=8050, help="Port for Dash app")
-    parser.add_argument("--output", type=str, default="pipeline_visualization_plotly.png", help="Output file for static image")
-    args = parser.parse_args()
-    # Create an example schedule
-    forward_times = [1.0] * args.num_stages
-    backward_times = [2.0] * args.num_stages
-    schedule = create_1f1b_schedule(
-        num_stages=args.num_stages,
-        num_batches=args.num_batches,
-        forward_times=forward_times,
-        backward_times=backward_times,
-    )
-    if args.interactive:
-        visualize_pipeline_parallelism_dash(schedule, port=args.port)
-    else:
-        save_pipeline_visualization_plotly(schedule, output_file=args.output)

 import dash
 from dash import dcc, html
+from dash.dependencies import Input, Output
 import plotly.graph_objects as go
+import argparse
+from typing import List, Dict, Literal, Optional
 from tqdm import tqdm
+import base64
+from src.execution_model import Schedule
+def convert_schedule_to_visualization_format(schedule: Schedule):
+    """
+    Converts a Schedule object to the format needed for visualization.
+    Returns:
+        Dict[int, List[Dict]]: Dictionary mapping device_id to a list of operation dictionaries
+    """
+    # Make sure all operations have start and end times
+    for op in schedule.ops.values():
+        if op.start_time is None or op.end_time is None:
+            raise ValueError("Operations must have start and end times. Run ScheduleExecutor.execute() first.")
+    visualization_data = {}
+    # Organize operations by device
+    for device_id, device_queue in enumerate(schedule.dev_queues):
+        visualization_data[device_id] = []
+        for op in device_queue.ops:
+            visualization_data[device_id].append({
+                "type": op.op_type,
+                "batch": op.batch_id + 1, # +1 because batch_id is 0-indexed
+                "stage": op.stage_id,
+                "start_time": op.start_time,
+                "duration": op.end_time - op.start_time
+            })
+    return visualization_data
+def create_pipeline_figure(schedule_data: Dict[int, List[Dict]], max_time=None, show_progress=True):
     """
     Create a Plotly figure for pipeline parallelism scheduling.
     Args:
+        schedule_data: Dictionary mapping device IDs to lists of tasks (converted from Schedule)
         max_time: Optional maximum time to display
         show_progress: Whether to show a progress bar
     """
+    # Find the number of devices
+    num_devices = len(schedule_data)
     empty_color = "whitesmoke"
+    # Colors for task types
+    def get_color(op_type: str, stage_id: int):
+        # Base colors
+        forward_base_color = "royalblue"
+        backward_base_color = "lightgreen"  # Changed from sandybrown to match your visualization
+        virtual_stage = stage_id // num_devices
+        if op_type == "forward":
+            if virtual_stage == 0:
+                return forward_base_color
+            else:
+                # Lighter shade for virtual_stage > 0
+                return "lightskyblue"
+        elif op_type == "backward":
+            if virtual_stage == 0:
+                return backward_base_color
+            else:
+                # Lighter shade for virtual_stage > 0
+                return "lightseagreen"
+        else:
+            raise ValueError(f"Invalid operation type: {op_type}")
     # Find the maximum time in the schedule if not provided
     if max_time is None:
         max_time = 0
+        for device in schedule_data:
+            for task in schedule_data[device]:
                 end_time = task["start_time"] + task["duration"]
                 if end_time > max_time:
                     max_time = end_time
     fig = go.Figure()
     # Initialize progress tracking
+    total_tasks = sum(len(tasks) for tasks in schedule_data.values())
     tasks_processed = 0
     if show_progress:
+        progress_bar = tqdm(total=total_tasks + num_devices + 3, desc="Creating visualization")
+    # Create a custom y-axis with no gaps between devices
+    y_spacing = 1.0  # Use 1.0 for no gaps
     # Add rectangles for each task
+    for device_idx, device in enumerate(schedule_data):
+        device_idx_reversed = num_devices - device_idx - 1
+        # Sort tasks by start time to ensure correct rendering
+        sorted_tasks = sorted(schedule_data[device], key=lambda t: t["start_time"])
+        for task in sorted_tasks:
             # Determine task color and text color
             if task["type"] == "forward":
+                color = get_color(task["type"], task["stage"])
                 text_color = "white"
                 name = "Forward"
             elif task["type"] == "backward":
+                color = get_color(task["type"], task["stage"])
                 text_color = "black"
                 name = "Backward"
+            else:
+                color = empty_color
                 text_color = "black"
+                name = "Unknown"
             # Add rectangle for the task
             start_time = task["start_time"]
             duration = task["duration"]
+            # Calculate y positions with no gaps
+            y_pos = device_idx_reversed * y_spacing
             # Create rectangle using shape
             fig.add_shape(
                 type="rect",
                 x0=start_time,
+                y0=y_pos - 0.5,
                 x1=start_time + duration,
+                y1=y_pos + 0.5,
                 line=dict(color="black", width=0.5),
                 fillcolor=color,
                 layer="above",
             # Add batch number text
             fig.add_annotation(
                 x=start_time + duration / 2,
+                y=y_pos,
+                text=f"{task['batch']}",  # Only show batch ID
                 showarrow=False,
+                font=dict(color=text_color, size=12, family="Arial, bold"),  # Increased font size
             )
+            # Add hover data with additional details
+            fig.add_trace(go.Scatter(
+                x=[start_time + duration / 2],
+                y=[y_pos],
+                mode='markers',
+                marker=dict(opacity=0),  # Invisible marker
+                hoverinfo='text',
+                text=f"Batch: {task['batch']}<br>Stage: {task['stage']}<br>Type: {name}<br>Start: {task['start_time']:.2f}<br>End: {task['start_time'] + task['duration']:.2f}<br>Duration: {task['duration']:.2f}",
+                showlegend=False
+            ))
             # Update progress
             if show_progress:
                 tasks_processed += 1
     # Add custom legend
     legend_items = [
+        dict(name="Forward", color=get_color("forward", 0)),
+        dict(name="Backward", color=get_color("backward", 0)),
     ]
     for i, item in enumerate(legend_items):
             progress_bar.update(1)
     # Set axis properties
+    device_labels = [f"Device {i}" for i in range(num_devices)]
+    device_labels.reverse()  # Reverse to put Device 0 at the top
+    # Calculate tick positions with no gaps
+    tick_positions = [(num_devices - i - 1) * y_spacing for i in range(num_devices)]
+    # Adjust the range to ensure there are no empty spaces at the end
+    x_end = max_time * 1.05  # Add a small margin
     fig.update_layout(
         yaxis=dict(
             tickmode="array",
+            tickvals=tick_positions,
             ticktext=device_labels,
             showgrid=False,
             zeroline=False,
         ),
+        margin=dict(l=50, r=20, t=40, b=40),
         plot_bgcolor="white",
+        title=dict(
+            text="Pipeline Parallelism Schedule",
+            x=0.5,
+            y=0.98,  # Move title position closer to the top
+            font=dict(size=20)
+        ),
         legend=dict(
             orientation="h",
+            yanchor="top",
+            y=-0.1,  # Position below the plot
             xanchor="center",
             x=0.5
+        ),
+        width=1600,
+        height=400,  # Reduce height to make the visualization more compact
+        bargap=0,
+        bargroupgap=0,
     )
     if show_progress:
+        progress_bar.update(1)
         progress_bar.close()
     return fig
+def create_dash_app(schedule: Schedule, schedule_type="1f1b"):
     """
+    Create a Dash app to visualize the pipeline schedule.
     Args:
+        schedule: Schedule object to visualize
+        schedule_type: Type of schedule ("1f1b" or other)
     """
+    # Convert schedule to visualization format
+    schedule_data = convert_schedule_to_visualization_format(schedule)
+    # Create the app
+    app = dash.Dash(__name__, title=f"Pipeline Parallelism Visualizer - {schedule_type}")
     app.layout = html.Div([
+        html.H1(f"Pipeline Parallelism Visualizer - {schedule_type}", style={'textAlign': 'center'}),
         html.Div([
+            html.Div([
+                html.H3("Schedule Configuration:"),
+                html.Ul([
+                    html.Li(f"Number of devices: {schedule.config.num_devices}"),
+                    html.Li(f"Number of stages: {schedule.config.num_stages}"),
+                    html.Li(f"Number of batches: {schedule.config.num_batches}"),
+                ]),
+            ], className="config-section"),
+            html.Button("Download Image", id="btn-download",
+                        style={
+                            'marginTop': '20px',
+                            'padding': '10px',
+                            'backgroundColor': '#007BFF',
+                            'color': 'white',
+                            'border': 'none',
+                            'borderRadius': '5px',
+                            'cursor': 'pointer'
+                        }),
+            dcc.Download(id="download-image"),
+        ], style={'margin': '20px'}),
+        html.Div(id="graph-container", children=[]),
+        dcc.Graph(
+            id="pipeline-graph",
+            config={'displayModeBar': True, 'toImageButtonOptions': {'format': 'png', 'filename': 'pipeline_visualization'}}
+        ),
     ])
     @app.callback(
     )
     def load_graph(_):
         # Create the figure when the app loads
+        return create_pipeline_figure(schedule_data, show_progress=True)
     @app.callback(
         Output("download-image", "data"),
         Input("btn-download", "n_clicks"),
         prevent_initial_call=True,
     )
     def download_image(n_clicks):
+        # Generate the figure for download
+        fig = create_pipeline_figure(schedule_data, show_progress=True)
+        # Convert to base64 image
+        img_bytes = fig.to_image(format="png", width=1600, height=1000, scale=2)
+        img_base64 = base64.b64encode(img_bytes).decode('ascii')
+        # Return the download data
         return dict(
+            content=img_base64,
+            filename=f"pipeline_visualization_{schedule_type}.png",
+            type="image/png",
+            base64=True
         )
     return app
 def visualize_pipeline_parallelism_dash(
+    schedule: Schedule,
     port: int = 8050,
     debug: bool = False
 ):
     """
+    Launch a Dash app to visualize the pipeline schedule interactively.
     Args:
+        schedule: Schedule object to visualize
+        port: Port to run the Dash app on
+        debug: Whether to run the Dash app in debug mode
     """
+    app = create_dash_app(schedule)
     print(f"Starting Dash app on http://localhost:{port}/")
     app.run_server(debug=debug, port=port)
 def save_pipeline_visualization_plotly(
+    schedule: Schedule,
     output_file: str = "pipeline_visualization_plotly.png",
 ):
     """
+    Save a static image of the pipeline schedule visualization.
     Args:
+        schedule: Schedule object to visualize
+        output_file: Path to save the image to
     """
+    schedule_data = convert_schedule_to_visualization_format(schedule)
+    fig = create_pipeline_figure(schedule_data, show_progress=True)
+    print(f"Saving visualization to {output_file}...")
+    fig.write_image(output_file, width=1600, height=400, scale=2)
     print(f"Visualization saved to {output_file}")

visualizer.py DELETED Viewed

@@ -1,141 +0,0 @@
-import matplotlib.pyplot as plt
-import numpy as np
-from matplotlib.patches import Rectangle
-from typing import List, Dict, Literal
-def visualize_pipeline_parallelism(
-    schedule: Dict[int, List[Dict]],
-    schedule_type: Literal["simple", "1f1b"] = "1f1b",
-    output_file: str = "pipeline_visualization.png",
-):
-    """
-    Visualize pipeline parallelism scheduling.
-    Args:
-        schedule: Dictionary mapping device IDs to lists of tasks.
-                 Each task is a dictionary with keys:
-                 - 'type': 'forward', 'backward', or 'optimizer'
-                 - 'batch': batch number
-                 - 'start_time': start time of the task
-                 - 'duration': duration of the task
-        schedule_type: Type of scheduling algorithm used ("simple" or "1f1b")
-        output_file: Path to save the visualization
-    """
-    # Colors for task types
-    forward_color = "royalblue"
-    backward_color = "sandybrown"  # Changed to match the reference image
-    optimizer_color = "#FFEFCF"    # Light beige for optimizer steps
-    empty_color = "whitesmoke"     # Very light gray for empty cells
-    # Find the number of stages (devices)
-    num_stages = len(schedule)
-    # Find the maximum time in the schedule
-    max_time = 0
-    for device in schedule:
-        for task in schedule[device]:
-            end_time = task["start_time"] + task["duration"]
-            if end_time > max_time:
-                max_time = end_time
-    # Create figure and axis
-    fig, ax = plt.subplots(figsize=(15, 4))
-    # Create an empty grid with light gray color
-    for device_idx in range(num_stages):
-        device_idx_reversed = num_stages - device_idx - 1  # Reverse the device index for plotting
-        for t in range(int(max_time) + 1):
-            rect = Rectangle(
-                (t, device_idx_reversed),
-                1.0,
-                1.0,
-                edgecolor="lightgray",
-                facecolor=empty_color,
-                linewidth=0.5,
-            )
-            ax.add_patch(rect)
-    # Plot the schedule
-    for device_idx, device in enumerate(schedule):
-        device_idx_reversed = num_stages - device_idx - 1  # Reverse the device index for plotting
-        for task in schedule[device]:
-            # Determine task color
-            if task["type"] == "forward":
-                color = forward_color
-                text_color = "white"
-            elif task["type"] == "backward":
-                color = backward_color
-                text_color = "black"
-            else:  # optimizer or any other type
-                color = optimizer_color
-                text_color = "black"
-            rect = Rectangle(
-                (task["start_time"], device_idx_reversed),
-                task["duration"],
-                1.0,
-                edgecolor="black",
-                facecolor=color,
-                linewidth=0.5,
-            )
-            ax.add_patch(rect)
-            # Add text (batch number)
-            ax.text(
-                task["start_time"] + task["duration"] / 2,
-                device_idx_reversed + 0.5,
-                str(task["batch"]),
-                ha="center",
-                va="center",
-                fontsize=10,
-                fontweight="bold",
-                color=text_color,
-            )
-    # Set axis limits and labels
-    ax.set_xlim(0, max_time + 0.5)
-    ax.set_ylim(-0.5, num_stages + 0.5)
-    ax.set_yticks(np.arange(num_stages) + 0.5)
-    # Reverse the order: Device 1 at the top, highest number at the bottom
-    device_labels = [f"Device {i+1}" for i in range(num_stages)]
-    device_labels.reverse()  # Reverse to put Device 1 at the top
-    ax.set_yticklabels(device_labels)
-    # Add "Time" label and arrow at the bottom
-    arrow_y = -0.4
-    ax.text(0.5, arrow_y, "Time", ha="right", va="center", fontsize=10)
-    ax.annotate("", xy=(2, arrow_y), xytext=(1, arrow_y),
-                arrowprops=dict(arrowstyle="->", lw=1))
-    # Remove the x-axis ticks
-    ax.set_xticks([])
-    # Remove the outer frame/border
-    for spine in ax.spines.values():
-        spine.set_visible(False)
-    # Add a legend - using 3 parts like in the reference image
-    forward_patch = Rectangle((0, 0), 1, 1, facecolor=forward_color)
-    backward_patch = Rectangle((0, 0), 1, 1, facecolor=backward_color)
-    optimizer_patch = Rectangle((0, 0), 1, 1, facecolor=optimizer_color)
-    legend = ax.legend(
-        [forward_patch, backward_patch, optimizer_patch],
-        ["Forward", "Backward", "Optimizer step"],
-        loc="upper center",
-        bbox_to_anchor=(0.5, -0.15),
-        ncol=3,
-        frameon=False,
-    )
-    # Turn off grid
-    ax.grid(False)
-    # Save the figure
-    plt.tight_layout()
-    plt.savefig(output_file, dpi=300, bbox_inches="tight")
-    plt.close()
-    print(f"Visualization saved to {output_file}")