Spaces:
				
			
			
	
			
			
					
		Running
		
	
	
	
			
			
	
	
	
	
		
		
					
		Running
		
	adds more documentation
Browse files- .cursorrules +277 -0
- .gitignore +2 -1
    	
        .cursorrules
    ADDED
    
    | @@ -0,0 +1,277 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            ---
         | 
| 2 | 
            +
            description: SmolLM3 Fine-tuning Pipeline - Project Rules and Conventions
         | 
| 3 | 
            +
            globs: ["**/*.py", "**/*.sh", "**/*.md", "**/*.json"]
         | 
| 4 | 
            +
            alwaysApply: true
         | 
| 5 | 
            +
            ---
         | 
| 6 | 
            +
             | 
| 7 | 
            +
            # SmolLM3 Fine-tuning Pipeline Project Rules
         | 
| 8 | 
            +
             | 
| 9 | 
            +
            ## Project Overview
         | 
| 10 | 
            +
            This is a comprehensive end-to-end fine-tuning pipeline for SmolLM3 models with Trackio monitoring, Hugging Face integration, and interactive configuration management.
         | 
| 11 | 
            +
             | 
| 12 | 
            +
            ## Core Architecture
         | 
| 13 | 
            +
             | 
| 14 | 
            +
            ### Directory Structure
         | 
| 15 | 
            +
            - `config/` - Training configuration files for different scenarios
         | 
| 16 | 
            +
            - `src/` - Core training and model logic
         | 
| 17 | 
            +
            - `scripts/` - Utility scripts for deployment, dataset management, and model pushing
         | 
| 18 | 
            +
            - `docs/` - Comprehensive documentation and guides
         | 
| 19 | 
            +
            - `templates/` - Templates for HF Spaces and datasets
         | 
| 20 | 
            +
            - `tests/` - Test files and debugging scripts
         | 
| 21 | 
            +
            - `outputs/` - Training outputs and checkpoints
         | 
| 22 | 
            +
             | 
| 23 | 
            +
            ### Key Components
         | 
| 24 | 
            +
             | 
| 25 | 
            +
            #### Training Configurations
         | 
| 26 | 
            +
            - **Basic Training**: SmolLM3-3B + OpenHermes-FR, 3 epochs, batch size 2
         | 
| 27 | 
            +
            - **H100 Lightweight**: SmolLM3-3B + OpenHermes-FR (80K samples), 1 epoch, batch size 16
         | 
| 28 | 
            +
            - **A100 Large Scale**: SmolLM3-3B + OpenHermes-FR, 1.3 passes, batch size 8
         | 
| 29 | 
            +
            - **Multiple Passes**: SmolLM3-3B + OpenHermes-FR, 4 epochs, batch size 6
         | 
| 30 | 
            +
            - **Custom Configuration**: User-defined parameters
         | 
| 31 | 
            +
             | 
| 32 | 
            +
            #### Core Modules
         | 
| 33 | 
            +
            - `src/train.py` - Main training orchestration
         | 
| 34 | 
            +
            - `src/model.py` - Model loading and configuration
         | 
| 35 | 
            +
            - `src/data.py` - Dataset processing and loading
         | 
| 36 | 
            +
            - `src/monitoring.py` - Trackio integration and metrics
         | 
| 37 | 
            +
            - `src/trainer.py` - Training loop and optimization
         | 
| 38 | 
            +
             | 
| 39 | 
            +
            ## Coding Conventions
         | 
| 40 | 
            +
             | 
| 41 | 
            +
            ### Python Style
         | 
| 42 | 
            +
            - Use type hints for all function parameters and return values
         | 
| 43 | 
            +
            - Follow PEP 8 for formatting
         | 
| 44 | 
            +
            - Use descriptive variable names in snake_case
         | 
| 45 | 
            +
            - Add comprehensive docstrings for all functions
         | 
| 46 | 
            +
            - Use f-strings for string formatting
         | 
| 47 | 
            +
             | 
| 48 | 
            +
            ### Configuration Management
         | 
| 49 | 
            +
            - All training configs inherit from `SmolLM3Config` base class
         | 
| 50 | 
            +
            - Use dataclasses for configuration objects
         | 
| 51 | 
            +
            - Validate configuration parameters in __post_init__
         | 
| 52 | 
            +
            - Support both YAML and Python configuration files
         | 
| 53 | 
            +
             | 
| 54 | 
            +
            ### Error Handling
         | 
| 55 | 
            +
            - Use try-except blocks for external API calls (HF, Trackio)
         | 
| 56 | 
            +
            - Log errors with appropriate context
         | 
| 57 | 
            +
            - Provide user-friendly error messages
         | 
| 58 | 
            +
            - Implement graceful degradation for optional features
         | 
| 59 | 
            +
             | 
| 60 | 
            +
            ### Monitoring Integration
         | 
| 61 | 
            +
            - Always include Trackio URL and experiment name in configs
         | 
| 62 | 
            +
            - Log metrics every N steps (configurable)
         | 
| 63 | 
            +
            - Save checkpoints and artifacts to HF Datasets
         | 
| 64 | 
            +
            - Use structured logging with consistent field names
         | 
| 65 | 
            +
             | 
| 66 | 
            +
            ## File Naming Conventions
         | 
| 67 | 
            +
             | 
| 68 | 
            +
            ### Configuration Files
         | 
| 69 | 
            +
            - `train_smollm3_*.py` - Training configurations
         | 
| 70 | 
            +
            - `*_config.py` - General configuration files
         | 
| 71 | 
            +
            - Use descriptive suffixes: `_h100_lightweight`, `_a100_large`, `_multiple_passes`
         | 
| 72 | 
            +
             | 
| 73 | 
            +
            ### Script Files
         | 
| 74 | 
            +
            - `deploy_*.py` - Deployment scripts
         | 
| 75 | 
            +
            - `setup_*.py` - Setup and initialization scripts
         | 
| 76 | 
            +
            - `push_*.py` - Model pushing scripts
         | 
| 77 | 
            +
            - `configure_*.py` - Configuration scripts
         | 
| 78 | 
            +
             | 
| 79 | 
            +
            ### Test Files
         | 
| 80 | 
            +
            - `test_*.py` - Test files
         | 
| 81 | 
            +
            - `debug_*.py` - Debugging scripts
         | 
| 82 | 
            +
            - Include descriptive names indicating what they test
         | 
| 83 | 
            +
             | 
| 84 | 
            +
            ## Training Pipeline Workflow
         | 
| 85 | 
            +
             | 
| 86 | 
            +
            ### Interactive Pipeline (`launch.sh`)
         | 
| 87 | 
            +
            1. **Authentication**: HF username and token validation
         | 
| 88 | 
            +
            2. **Configuration Selection**: Choose from predefined configs or custom
         | 
| 89 | 
            +
            3. **Experiment Setup**: Configure experiment name and repositories
         | 
| 90 | 
            +
            4. **Environment Setup**: Install dependencies and setup virtual environment
         | 
| 91 | 
            +
            5. **Deployment**: Deploy Trackio Space and setup HF Dataset
         | 
| 92 | 
            +
            6. **Training**: Execute training with monitoring
         | 
| 93 | 
            +
            7. **Model Push**: Upload model to HF Hub with documentation
         | 
| 94 | 
            +
            8. **Testing**: Validate uploaded model functionality
         | 
| 95 | 
            +
             | 
| 96 | 
            +
            ### Configuration Selection Logic
         | 
| 97 | 
            +
            - Basic Training: Default for beginners and learning
         | 
| 98 | 
            +
            - H100 Lightweight: Rapid experiments on H100 GPUs
         | 
| 99 | 
            +
            - A100 Large Scale: Serious research and production
         | 
| 100 | 
            +
            - Multiple Passes: Thorough training for production models
         | 
| 101 | 
            +
            - Custom: User-defined parameters for specific needs
         | 
| 102 | 
            +
             | 
| 103 | 
            +
            ## Dataset Management
         | 
| 104 | 
            +
             | 
| 105 | 
            +
            ### Supported Formats
         | 
| 106 | 
            +
            - Hugging Face Datasets format
         | 
| 107 | 
            +
            - JSON files with prompt/completion pairs
         | 
| 108 | 
            +
            - Chat format with messages array
         | 
| 109 | 
            +
            - Custom formats with conversion functions
         | 
| 110 | 
            +
             | 
| 111 | 
            +
            ### Dataset Processing
         | 
| 112 | 
            +
            - Automatic format detection and conversion
         | 
| 113 | 
            +
            - Random sampling for lightweight configurations
         | 
| 114 | 
            +
            - Validation split creation
         | 
| 115 | 
            +
            - Bad entry filtering and handling
         | 
| 116 | 
            +
             | 
| 117 | 
            +
            ### Dataset Sampling (H100 Lightweight)
         | 
| 118 | 
            +
            - 80,000 random samples from OpenHermes-FR
         | 
| 119 | 
            +
            - 1,000 validation samples (if available)
         | 
| 120 | 
            +
            - Fixed random seed (42) for reproducibility
         | 
| 121 | 
            +
            - Automatic sampling during dataset preparation
         | 
| 122 | 
            +
             | 
| 123 | 
            +
            ## Model Management
         | 
| 124 | 
            +
             | 
| 125 | 
            +
            ### Model Loading
         | 
| 126 | 
            +
            - Support for HuggingFaceTB/SmolLM3-3B
         | 
| 127 | 
            +
            - Flash attention and gradient checkpointing
         | 
| 128 | 
            +
            - Mixed precision training (fp16/bf16)
         | 
| 129 | 
            +
            - Device mapping and memory optimization
         | 
| 130 | 
            +
             | 
| 131 | 
            +
            ### Model Pushing
         | 
| 132 | 
            +
            - Comprehensive model cards with training details
         | 
| 133 | 
            +
            - Automatic README generation
         | 
| 134 | 
            +
            - License and usage information
         | 
| 135 | 
            +
            - Training metrics and configuration
         | 
| 136 | 
            +
             | 
| 137 | 
            +
            ## Monitoring and Tracking
         | 
| 138 | 
            +
             | 
| 139 | 
            +
            ### Trackio Integration
         | 
| 140 | 
            +
            - Real-time metrics logging
         | 
| 141 | 
            +
            - Training curves visualization
         | 
| 142 | 
            +
            - Resource usage monitoring
         | 
| 143 | 
            +
            - Artifact storage and versioning
         | 
| 144 | 
            +
             | 
| 145 | 
            +
            ### Metrics to Track
         | 
| 146 | 
            +
            - Training and validation loss
         | 
| 147 | 
            +
            - Learning rate schedule
         | 
| 148 | 
            +
            - Gradient norms
         | 
| 149 | 
            +
            - GPU utilization and memory
         | 
| 150 | 
            +
            - Training speed (steps/second)
         | 
| 151 | 
            +
             | 
| 152 | 
            +
            ## Error Handling and Validation
         | 
| 153 | 
            +
             | 
| 154 | 
            +
            ### Input Validation
         | 
| 155 | 
            +
            - Validate HF tokens before use
         | 
| 156 | 
            +
            - Check CUDA availability
         | 
| 157 | 
            +
            - Verify dataset accessibility
         | 
| 158 | 
            +
            - Validate configuration parameters
         | 
| 159 | 
            +
             | 
| 160 | 
            +
            ### Error Recovery
         | 
| 161 | 
            +
            - Graceful handling of network issues
         | 
| 162 | 
            +
            - Automatic retry for failed operations
         | 
| 163 | 
            +
            - Checkpoint recovery for interrupted training
         | 
| 164 | 
            +
            - Fallback options for optional features
         | 
| 165 | 
            +
             | 
| 166 | 
            +
            ## Documentation Standards
         | 
| 167 | 
            +
             | 
| 168 | 
            +
            ### README Files
         | 
| 169 | 
            +
            - Clear project description
         | 
| 170 | 
            +
            - Installation instructions
         | 
| 171 | 
            +
            - Usage examples
         | 
| 172 | 
            +
            - Configuration options
         | 
| 173 | 
            +
            - Troubleshooting guide
         | 
| 174 | 
            +
             | 
| 175 | 
            +
            ### Code Documentation
         | 
| 176 | 
            +
            - Comprehensive docstrings
         | 
| 177 | 
            +
            - Type hints for all functions
         | 
| 178 | 
            +
            - Example usage in docstrings
         | 
| 179 | 
            +
            - Parameter descriptions
         | 
| 180 | 
            +
            - Return value documentation
         | 
| 181 | 
            +
             | 
| 182 | 
            +
            ## Testing and Validation
         | 
| 183 | 
            +
             | 
| 184 | 
            +
            ### Test Categories
         | 
| 185 | 
            +
            - Unit tests for core functions
         | 
| 186 | 
            +
            - Integration tests for pipeline
         | 
| 187 | 
            +
            - Configuration validation tests
         | 
| 188 | 
            +
            - Model loading and saving tests
         | 
| 189 | 
            +
            - Dataset processing tests
         | 
| 190 | 
            +
             | 
| 191 | 
            +
            ### Debugging Tools
         | 
| 192 | 
            +
            - Standalone test scripts
         | 
| 193 | 
            +
            - Configuration validation
         | 
| 194 | 
            +
            - Model testing utilities
         | 
| 195 | 
            +
            - Dataset inspection tools
         | 
| 196 | 
            +
             | 
| 197 | 
            +
            ## Performance Optimization
         | 
| 198 | 
            +
             | 
| 199 | 
            +
            ### H100 Optimizations
         | 
| 200 | 
            +
            - Larger batch sizes (16 vs 8 for A100)
         | 
| 201 | 
            +
            - Reduced gradient accumulation (4 vs 16)
         | 
| 202 | 
            +
            - Higher learning rates (8e-6 vs 5e-6)
         | 
| 203 | 
            +
            - Optimized data loading (4 workers, pin memory)
         | 
| 204 | 
            +
             | 
| 205 | 
            +
            ### Memory Management
         | 
| 206 | 
            +
            - Gradient checkpointing for large models
         | 
| 207 | 
            +
            - Mixed precision training
         | 
| 208 | 
            +
            - Dynamic batch sizing
         | 
| 209 | 
            +
            - Memory-efficient data loading
         | 
| 210 | 
            +
             | 
| 211 | 
            +
            ## Security and Best Practices
         | 
| 212 | 
            +
             | 
| 213 | 
            +
            ### Token Management
         | 
| 214 | 
            +
            - Never hardcode tokens in code
         | 
| 215 | 
            +
            - Use environment variables
         | 
| 216 | 
            +
            - Validate tokens before use
         | 
| 217 | 
            +
            - Secure token storage
         | 
| 218 | 
            +
             | 
| 219 | 
            +
            ### Data Privacy
         | 
| 220 | 
            +
            - Filter sensitive data from datasets
         | 
| 221 | 
            +
            - Validate dataset contents
         | 
| 222 | 
            +
            - Secure data transmission
         | 
| 223 | 
            +
            - Proper data disposal
         | 
| 224 | 
            +
             | 
| 225 | 
            +
            ## Deployment and CI/CD
         | 
| 226 | 
            +
             | 
| 227 | 
            +
            ### Environment Setup
         | 
| 228 | 
            +
            - Python virtual environments
         | 
| 229 | 
            +
            - CUDA-compatible PyTorch
         | 
| 230 | 
            +
            - Required dependencies installation
         | 
| 231 | 
            +
            - System package management
         | 
| 232 | 
            +
             | 
| 233 | 
            +
            ### Automated Deployment
         | 
| 234 | 
            +
            - Trackio Space deployment
         | 
| 235 | 
            +
            - HF Dataset setup
         | 
| 236 | 
            +
            - Model repository creation
         | 
| 237 | 
            +
            - Configuration file generation
         | 
| 238 | 
            +
             | 
| 239 | 
            +
            ## Troubleshooting Guidelines
         | 
| 240 | 
            +
             | 
| 241 | 
            +
            ### Common Issues
         | 
| 242 | 
            +
            - CUDA out of memory: Reduce batch size
         | 
| 243 | 
            +
            - Network timeouts: Check internet connection
         | 
| 244 | 
            +
            - Token validation: Verify HF token permissions
         | 
| 245 | 
            +
            - Dataset loading: Check dataset accessibility
         | 
| 246 | 
            +
             | 
| 247 | 
            +
            ### Debugging Steps
         | 
| 248 | 
            +
            1. Check system requirements
         | 
| 249 | 
            +
            2. Validate configuration
         | 
| 250 | 
            +
            3. Test individual components
         | 
| 251 | 
            +
            4. Review logs and error messages
         | 
| 252 | 
            +
            5. Verify external service connectivity
         | 
| 253 | 
            +
             | 
| 254 | 
            +
            ## Future Enhancements
         | 
| 255 | 
            +
             | 
| 256 | 
            +
            ### Planned Features
         | 
| 257 | 
            +
            - Multi-GPU training support
         | 
| 258 | 
            +
            - Advanced dataset sampling strategies
         | 
| 259 | 
            +
            - Automated hyperparameter optimization
         | 
| 260 | 
            +
            - Enhanced monitoring and visualization
         | 
| 261 | 
            +
            - Support for additional model architectures
         | 
| 262 | 
            +
             | 
| 263 | 
            +
            ### Extensibility
         | 
| 264 | 
            +
            - Modular configuration system
         | 
| 265 | 
            +
            - Plugin architecture for custom features
         | 
| 266 | 
            +
            - Support for custom datasets and models
         | 
| 267 | 
            +
            - Flexible monitoring integration
         | 
| 268 | 
            +
             | 
| 269 | 
            +
            ---
         | 
| 270 | 
            +
             | 
| 271 | 
            +
            **When working with this codebase:**
         | 
| 272 | 
            +
            - Always consider the end-to-end pipeline workflow
         | 
| 273 | 
            +
            - Follow the established configuration patterns
         | 
| 274 | 
            +
            - Include proper error handling and validation
         | 
| 275 | 
            +
            - Maintain comprehensive documentation
         | 
| 276 | 
            +
            - Test changes thoroughly before deployment
         | 
| 277 | 
            +
            - Consider performance implications for different hardware configurations 
         | 
    	
        .gitignore
    CHANGED
    
    | @@ -1,4 +1,5 @@ | |
| 1 | 
            -
            . | 
|  | |
| 2 | 
             
            *.mdc
         | 
| 3 |  | 
| 4 | 
             
            # Python
         | 
|  | |
| 1 | 
            +
            .cursor/
         | 
| 2 | 
            +
            .cursor/rules/
         | 
| 3 | 
             
            *.mdc
         | 
| 4 |  | 
| 5 | 
             
            # Python
         | 
