File size: 5,249 Bytes
3718631
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
# gprMax RAG Database System

## Overview
This is a production-ready Retrieval-Augmented Generation (RAG) system for gprMax documentation. It provides efficient vector search capabilities for the gprMax documentation, enabling intelligent context retrieval for the chatbot.

## Architecture

### Components
1. **Document Processor**: Extracts and chunks documentation from gprMax GitHub repository
2. **Embedding Model**: Qwen2.5-0.5B (will upgrade to Qwen3-Embedding-0.6B when available)
3. **Vector Database**: ChromaDB with persistent storage
4. **Retriever**: Search and context retrieval utilities

### Key Features
- Automatic documentation extraction from gprMax GitHub repository
- Intelligent chunking with configurable size and overlap
- Persistent vector database using ChromaDB
- Efficient similarity search with score thresholding
- Metadata tracking for reproducibility

## Installation

The database is **automatically generated** on first startup of the application. No manual installation required!

## Automatic Generation

When the app starts:
1. Checks if database exists at `rag-db/chroma_db/`
2. If not found, automatically runs `generate_db.py`
3. Clones gprMax repository and processes documentation
4. Creates ChromaDB with default embeddings (all-MiniLM-L6-v2)
5. Ready to use - this only happens once!

## Manual Generation (Optional)

If you need to manually regenerate the database:

```bash
cd rag-db
python generate_db.py --recreate
```

Custom settings:
```bash
python generate_db.py \
    --db-path ./custom_db \
    --temp-dir ./temp \
    --device cuda \
    --recreate
```

### 2. Use Retriever in Application

```python
from rag_db.retriever import create_retriever

# Initialize retriever
retriever = create_retriever(db_path="./rag-db/chroma_db")

# Search for relevant documents
results = retriever.search("How to create a source?", k=5)

# Get formatted context for LLM
context = retriever.get_context("antenna patterns", k=3)

# Get relevant source files
files = retriever.get_relevant_files("boundary conditions")

# Get database statistics
stats = retriever.get_stats()
```

### 3. Test Retriever

```bash
# Test with default query
python retriever.py

# Test with custom query
python retriever.py "How to model soil layers?"
```

## Database Schema

### Document Structure
```json
{
    "id": "unique_hash",
    "text": "document_chunk_text",
    "metadata": {
        "source": "docs/relative/path.rst",
        "file_type": ".rst",
        "chunk_index": 0,
        "char_start": 0,
        "char_end": 1000
    }
}
```

### Metadata File
Generated `metadata.json` contains:
```json
{
    "created_at": "2024-01-01T00:00:00",
    "embedding_model": "Qwen/Qwen2.5-0.5B",
    "collection_name": "gprmax_docs_v1",
    "chunk_size": 1000,
    "chunk_overlap": 200,
    "total_documents": 1234
}
```

## Configuration

### Chunking Parameters
- `CHUNK_SIZE`: 1000 characters (optimal for context windows)
- `CHUNK_OVERLAP`: 200 characters (ensures continuity)

### Embedding Model
- Current: `Qwen/Qwen2.5-0.5B` (512-dim embeddings)
- Future: `Qwen/Qwen3-Embedding-0.6B` (when available)

### Database Settings
- Storage: ChromaDB persistent client
- Collection: `gprmax_docs_v1` (versioned for updates)
- Distance Metric: Cosine similarity

## Maintenance

### Regular Updates
Run monthly or when gprMax documentation updates:
```bash
# This will pull latest docs and update database
python generate_db.py
```

### Database Backup
```bash
# Backup database
cp -r chroma_db chroma_db_backup_$(date +%Y%m%d)
```

### Performance Tuning
- Adjust `CHUNK_SIZE` and `CHUNK_OVERLAP` in `generate_db.py`
- Modify batch sizes for large datasets
- Use GPU acceleration with `--device cuda`

## Integration with Main App

The RAG system integrates with the main Gradio app:

1. Import retriever in `app.py`
2. Use retriever to augment prompts with context
3. Display source references in UI

Example integration:
```python
# In app.py
from rag_db.retriever import create_retriever

retriever = create_retriever()

def augment_with_context(user_query):
    context = retriever.get_context(user_query, k=3)
    augmented_prompt = f"""
    Context from documentation:
    {context}
    
    User question: {user_query}
    """
    return augmented_prompt
```

## Troubleshooting

### Common Issues

1. **Database not found**
   - Run `python generate_db.py` first
   - Check `--db-path` parameter

2. **Out of memory**
   - Use smaller batch sizes
   - Use CPU instead of GPU
   - Reduce chunk size

3. **Slow generation**
   - Use GPU with `--device cuda`
   - Reduce repository depth with shallow clone
   - Use pre-generated database

### Logs
Check generation logs for detailed information:
```bash
python generate_db.py 2>&1 | tee generation.log
```

## Future Enhancements

1. **Model Upgrade**: Migrate to Qwen3-Embedding-0.6B when available
2. **Incremental Updates**: Add documents without full regeneration
3. **Multi-modal Support**: Include images and diagrams from docs
4. **Query Expansion**: Automatic query reformulation for better retrieval
5. **Caching Layer**: Redis cache for frequent queries
6. **Fine-tuned Embeddings**: Domain-specific embedding model for gprMax

## License
Same as parent project