Spaces:
Running
on
Zero
Running
on
Zero
File size: 5,249 Bytes
3718631 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
# gprMax RAG Database System
## Overview
This is a production-ready Retrieval-Augmented Generation (RAG) system for gprMax documentation. It provides efficient vector search capabilities for the gprMax documentation, enabling intelligent context retrieval for the chatbot.
## Architecture
### Components
1. **Document Processor**: Extracts and chunks documentation from gprMax GitHub repository
2. **Embedding Model**: Qwen2.5-0.5B (will upgrade to Qwen3-Embedding-0.6B when available)
3. **Vector Database**: ChromaDB with persistent storage
4. **Retriever**: Search and context retrieval utilities
### Key Features
- Automatic documentation extraction from gprMax GitHub repository
- Intelligent chunking with configurable size and overlap
- Persistent vector database using ChromaDB
- Efficient similarity search with score thresholding
- Metadata tracking for reproducibility
## Installation
The database is **automatically generated** on first startup of the application. No manual installation required!
## Automatic Generation
When the app starts:
1. Checks if database exists at `rag-db/chroma_db/`
2. If not found, automatically runs `generate_db.py`
3. Clones gprMax repository and processes documentation
4. Creates ChromaDB with default embeddings (all-MiniLM-L6-v2)
5. Ready to use - this only happens once!
## Manual Generation (Optional)
If you need to manually regenerate the database:
```bash
cd rag-db
python generate_db.py --recreate
```
Custom settings:
```bash
python generate_db.py \
--db-path ./custom_db \
--temp-dir ./temp \
--device cuda \
--recreate
```
### 2. Use Retriever in Application
```python
from rag_db.retriever import create_retriever
# Initialize retriever
retriever = create_retriever(db_path="./rag-db/chroma_db")
# Search for relevant documents
results = retriever.search("How to create a source?", k=5)
# Get formatted context for LLM
context = retriever.get_context("antenna patterns", k=3)
# Get relevant source files
files = retriever.get_relevant_files("boundary conditions")
# Get database statistics
stats = retriever.get_stats()
```
### 3. Test Retriever
```bash
# Test with default query
python retriever.py
# Test with custom query
python retriever.py "How to model soil layers?"
```
## Database Schema
### Document Structure
```json
{
"id": "unique_hash",
"text": "document_chunk_text",
"metadata": {
"source": "docs/relative/path.rst",
"file_type": ".rst",
"chunk_index": 0,
"char_start": 0,
"char_end": 1000
}
}
```
### Metadata File
Generated `metadata.json` contains:
```json
{
"created_at": "2024-01-01T00:00:00",
"embedding_model": "Qwen/Qwen2.5-0.5B",
"collection_name": "gprmax_docs_v1",
"chunk_size": 1000,
"chunk_overlap": 200,
"total_documents": 1234
}
```
## Configuration
### Chunking Parameters
- `CHUNK_SIZE`: 1000 characters (optimal for context windows)
- `CHUNK_OVERLAP`: 200 characters (ensures continuity)
### Embedding Model
- Current: `Qwen/Qwen2.5-0.5B` (512-dim embeddings)
- Future: `Qwen/Qwen3-Embedding-0.6B` (when available)
### Database Settings
- Storage: ChromaDB persistent client
- Collection: `gprmax_docs_v1` (versioned for updates)
- Distance Metric: Cosine similarity
## Maintenance
### Regular Updates
Run monthly or when gprMax documentation updates:
```bash
# This will pull latest docs and update database
python generate_db.py
```
### Database Backup
```bash
# Backup database
cp -r chroma_db chroma_db_backup_$(date +%Y%m%d)
```
### Performance Tuning
- Adjust `CHUNK_SIZE` and `CHUNK_OVERLAP` in `generate_db.py`
- Modify batch sizes for large datasets
- Use GPU acceleration with `--device cuda`
## Integration with Main App
The RAG system integrates with the main Gradio app:
1. Import retriever in `app.py`
2. Use retriever to augment prompts with context
3. Display source references in UI
Example integration:
```python
# In app.py
from rag_db.retriever import create_retriever
retriever = create_retriever()
def augment_with_context(user_query):
context = retriever.get_context(user_query, k=3)
augmented_prompt = f"""
Context from documentation:
{context}
User question: {user_query}
"""
return augmented_prompt
```
## Troubleshooting
### Common Issues
1. **Database not found**
- Run `python generate_db.py` first
- Check `--db-path` parameter
2. **Out of memory**
- Use smaller batch sizes
- Use CPU instead of GPU
- Reduce chunk size
3. **Slow generation**
- Use GPU with `--device cuda`
- Reduce repository depth with shallow clone
- Use pre-generated database
### Logs
Check generation logs for detailed information:
```bash
python generate_db.py 2>&1 | tee generation.log
```
## Future Enhancements
1. **Model Upgrade**: Migrate to Qwen3-Embedding-0.6B when available
2. **Incremental Updates**: Add documents without full regeneration
3. **Multi-modal Support**: Include images and diagrams from docs
4. **Query Expansion**: Automatic query reformulation for better retrieval
5. **Caching Layer**: Redis cache for frequent queries
6. **Fine-tuned Embeddings**: Domain-specific embedding model for gprMax
## License
Same as parent project |