Spaces:

jfang
/

gprmax-support-gsoc25

Running on Zero

File size: 5,249 Bytes
# gprMax RAG Database System

## Overview
This is a production-ready Retrieval-Augmented Generation (RAG) system for gprMax documentation. It provides efficient vector search capabilities for the gprMax documentation, enabling intelligent context retrieval for the chatbot.

## Architecture

### Components
1. **Document Processor**: Extracts and chunks documentation from gprMax GitHub repository
2. **Embedding Model**: Qwen2.5-0.5B (will upgrade to Qwen3-Embedding-0.6B when available)
3. **Vector Database**: ChromaDB with persistent storage
4. **Retriever**: Search and context retrieval utilities

### Key Features
- Automatic documentation extraction from gprMax GitHub repository
- Intelligent chunking with configurable size and overlap
- Persistent vector database using ChromaDB
- Efficient similarity search with score thresholding
- Metadata tracking for reproducibility

## Installation

The database is **automatically generated** on first startup of the application. No manual installation required!

## Automatic Generation

When the app starts:
1. Checks if database exists at `rag-db/chroma_db/`
2. If not found, automatically runs `generate_db.py`
3. Clones gprMax repository and processes documentation
4. Creates ChromaDB with default embeddings (all-MiniLM-L6-v2)
5. Ready to use - this only happens once!

## Manual Generation (Optional)

If you need to manually regenerate the database:

```bash
cd rag-db
python generate_db.py --recreate
```

Custom settings:
```bash
python generate_db.py \
    --db-path ./custom_db \
    --temp-dir ./temp \
    --device cuda \
    --recreate
```

### 2. Use Retriever in Application

```python
from rag_db.retriever import create_retriever

# Initialize retriever
retriever = create_retriever(db_path="./rag-db/chroma_db")

# Search for relevant documents
results = retriever.search("How to create a source?", k=5)

# Get formatted context for LLM
context = retriever.get_context("antenna patterns", k=3)

# Get relevant source files
files = retriever.get_relevant_files("boundary conditions")

# Get database statistics
stats = retriever.get_stats()
```

### 3. Test Retriever

```bash
# Test with default query
python retriever.py

# Test with custom query
python retriever.py "How to model soil layers?"
```

## Database Schema

### Document Structure
```json
{
    "id": "unique_hash",
    "text": "document_chunk_text",
    "metadata": {
        "source": "docs/relative/path.rst",
        "file_type": ".rst",
        "chunk_index": 0,
        "char_start": 0,
        "char_end": 1000
    }
}
```

### Metadata File
Generated `metadata.json` contains:
```json
{
    "created_at": "2024-01-01T00:00:00",
    "embedding_model": "Qwen/Qwen2.5-0.5B",
    "collection_name": "gprmax_docs_v1",
    "chunk_size": 1000,
    "chunk_overlap": 200,
    "total_documents": 1234
}
```

## Configuration

### Chunking Parameters
- `CHUNK_SIZE`: 1000 characters (optimal for context windows)
- `CHUNK_OVERLAP`: 200 characters (ensures continuity)

### Embedding Model
- Current: `Qwen/Qwen2.5-0.5B` (512-dim embeddings)
- Future: `Qwen/Qwen3-Embedding-0.6B` (when available)

### Database Settings
- Storage: ChromaDB persistent client
- Collection: `gprmax_docs_v1` (versioned for updates)
- Distance Metric: Cosine similarity

## Maintenance

### Regular Updates
Run monthly or when gprMax documentation updates:
```bash
# This will pull latest docs and update database
python generate_db.py
```

### Database Backup
```bash
# Backup database
cp -r chroma_db chroma_db_backup_$(date +%Y%m%d)
```

### Performance Tuning
- Adjust `CHUNK_SIZE` and `CHUNK_OVERLAP` in `generate_db.py`
- Modify batch sizes for large datasets
- Use GPU acceleration with `--device cuda`

## Integration with Main App

The RAG system integrates with the main Gradio app:

1. Import retriever in `app.py`
2. Use retriever to augment prompts with context
3. Display source references in UI

Example integration:
```python
# In app.py
from rag_db.retriever import create_retriever

retriever = create_retriever()

def augment_with_context(user_query):
    context = retriever.get_context(user_query, k=3)
    augmented_prompt = f"""
    Context from documentation:
    {context}
    
    User question: {user_query}
    """
    return augmented_prompt
```

## Troubleshooting

### Common Issues

1. **Database not found**
   - Run `python generate_db.py` first
   - Check `--db-path` parameter

2. **Out of memory**
   - Use smaller batch sizes
   - Use CPU instead of GPU
   - Reduce chunk size

3. **Slow generation**
   - Use GPU with `--device cuda`
   - Reduce repository depth with shallow clone
   - Use pre-generated database

### Logs
Check generation logs for detailed information:
```bash
python generate_db.py 2>&1 | tee generation.log
```

## Future Enhancements

1. **Model Upgrade**: Migrate to Qwen3-Embedding-0.6B when available
2. **Incremental Updates**: Add documents without full regeneration
3. **Multi-modal Support**: Include images and diagrams from docs
4. **Query Expansion**: Automatic query reformulation for better retrieval
5. **Caching Layer**: Redis cache for frequent queries
6. **Fine-tuned Embeddings**: Domain-specific embedding model for gprMax

## License
Same as parent project