Spaces:
Sleeping
Sleeping
Commit
·
bbcd957
1
Parent(s):
1597d99
Upd README
Browse files
README.md
CHANGED
|
@@ -10,29 +10,31 @@ license: apache-2.0
|
|
| 10 |
short_description: Ed-Assistant summary your learning journey with Agentic RAG
|
| 11 |
---
|
| 12 |
|
| 13 |
-
### StudyBuddy
|
| 14 |
-
[
|
| 15 |
-
|
| 16 |
-
An end-to-end RAG (Retrieval-Augmented Generation) app for studying from your own documents. Upload PDF/DOCX files, the app extracts text and images, captions images, chunks into semantic "cards", embeds and stores them in MongoDB, and serves a chat endpoint that answers strictly from your uploaded materials. Includes a lightweight chat-memory feature to improve context continuity, cost-aware model routing, and robust provider retries.
|
| 17 |
|
| 18 |
## Features
|
| 19 |
|
| 20 |
-
- **Document ingestion**: PDF/DOCX parsing (PyMuPDF, python-docx), image extraction
|
| 21 |
-
- **Semantic chunking**: heuristic
|
| 22 |
-
- **Embeddings**: Sentence-Transformers (all-MiniLM-L6-v2
|
| 23 |
- **Vector search**: MongoDB Atlas Vector Search (optional) or local cosine fallback
|
| 24 |
-
- **RAG chat**: cost-aware routing between
|
| 25 |
-
- **
|
| 26 |
-
- **
|
| 27 |
-
- **
|
|
|
|
| 28 |
- **Simple UI**: static frontend under `static/`
|
| 29 |
|
| 30 |
-
##
|
| 31 |
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
-
|
|
|
|
|
|
|
| 36 |
|
| 37 |
## Project Structure
|
| 38 |
|
|
@@ -55,18 +57,23 @@ Dockerfile # container image
|
|
| 55 |
requirements.txt # Python dependencies
|
| 56 |
```
|
| 57 |
|
| 58 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
```bash
|
| 61 |
python -m venv .venv && source .venv/bin/activate
|
| 62 |
pip install -r requirements.txt
|
| 63 |
export MONGO_URI="mongodb://localhost:27017"
|
| 64 |
-
uvicorn app:app --reload
|
| 65 |
```
|
| 66 |
|
| 67 |
-
Open UI
|
| 68 |
|
| 69 |
-
Health: `http://localhost:8000/healthz`
|
| 70 |
|
| 71 |
## Configuration
|
| 72 |
|
|
@@ -83,7 +90,7 @@ Environment variables:
|
|
| 83 |
- **NVIDIA_SMALL**: override default NVIDIA small model
|
| 84 |
- Optional logging controls: use process env like `PYTHONWARNINGS=ignore` and manage verbosity per logger if needed
|
| 85 |
|
| 86 |
-
|
| 87 |
|
| 88 |
## Running (Local)
|
| 89 |
|
|
@@ -112,32 +119,34 @@ docker run --rm -p 8000:8000 \
|
|
| 112 |
|
| 113 |
For production, consider `--restart unless-stopped` and setting `--env ATLAS_VECTOR=1` if using Atlas Vector Search.
|
| 114 |
|
| 115 |
-
##
|
| 116 |
-
|
| 117 |
-
-
|
| 118 |
-
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
- GET
|
| 122 |
-
|
| 123 |
-
-
|
| 124 |
-
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
-
|
| 128 |
-
|
| 129 |
-
-
|
| 130 |
-
|
| 131 |
-
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
|
|
|
| 135 |
|
| 136 |
```bash
|
| 137 |
curl -X POST http://localhost:8000/chat \
|
| 138 |
-H 'Content-Type: application/x-www-form-urlencoded' \
|
| 139 |
-d 'user_id=user1' \
|
| 140 |
-
|
|
|
|
| 141 |
```
|
| 142 |
|
| 143 |
Upload example:
|
|
@@ -146,19 +155,21 @@ Upload example:
|
|
| 146 |
curl -X POST http://localhost:8000/upload \
|
| 147 |
-H 'Content-Type: multipart/form-data' \
|
| 148 |
-F 'user_id=user1' \
|
|
|
|
| 149 |
-F 'files=@/path/to/file1.pdf' \
|
| 150 |
-F 'files=@/path/to/file2.docx'
|
| 151 |
```
|
| 152 |
|
| 153 |
-
|
| 154 |
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
|
|
|
| 158 |
|
| 159 |
-
|
| 160 |
|
| 161 |
-
If using Atlas Vector Search, create an index
|
| 162 |
|
| 163 |
```json
|
| 164 |
{
|
|
@@ -175,9 +186,9 @@ If using Atlas Vector Search, create an index (UI or API) similar to:
|
|
| 175 |
}
|
| 176 |
```
|
| 177 |
|
| 178 |
-
Set `ATLAS_VECTOR=1` and `MONGO_VECTOR_INDEX
|
| 179 |
|
| 180 |
-
Schema overview:
|
| 181 |
|
| 182 |
- Collection `chunks` (per card):
|
| 183 |
- `user_id` (str), `filename` (str), `topic_name` (str), `summary` (str), `content` (str)
|
|
@@ -192,7 +203,7 @@ Schema overview:
|
|
| 192 |
- NVIDIA and Gemini calls use a simple key rotator. Provide one or more keys via `NVIDIA_API_1..5`, `GEMINI_API_1..5`.
|
| 193 |
- The app is defensive: if embeddings or summarization models are unavailable, it falls back to naive strategies to keep the app responsive (with reduced quality).
|
| 194 |
|
| 195 |
-
|
| 196 |
|
| 197 |
- Logs are tagged by module via `utils/logger.py`:
|
| 198 |
- [APP] app lifecycle, ingestion, chat flow
|
|
@@ -203,7 +214,7 @@ Schema overview:
|
|
| 203 |
- [CHUNKER]/[SUM]/[COMMON]/[PARSER] module-specific messages
|
| 204 |
- Change verbosity by setting the root logger level in code if needed
|
| 205 |
|
| 206 |
-
|
| 207 |
|
| 208 |
- Disable image captioning if CPU-bound by short-circuiting in `utils/caption.py` (return "")
|
| 209 |
- Use smaller `k` in `/chat` for fewer chunks
|
|
@@ -211,12 +222,29 @@ Schema overview:
|
|
| 211 |
- If Atlas Vector is unavailable, local cosine search samples up to 2000 docs; tune in `utils/rag.py`
|
| 212 |
- Run with `--workers` and consider a process manager for production
|
| 213 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 214 |
## Security Notes
|
| 215 |
|
| 216 |
- CORS is currently open (`allow_origins=["*"]`) for simplicity. Restrict in production
|
| 217 |
- Validate and limit upload sizes at the reverse proxy (e.g., nginx) or add checks in `/upload`
|
| 218 |
- Secrets are passed via environment; avoid committing them
|
| 219 |
|
|
|
|
| 220 |
## Troubleshooting
|
| 221 |
|
| 222 |
- Missing Python packages: install via `pip install -r requirements.txt`.
|
|
@@ -249,6 +277,5 @@ docker-run:
|
|
| 249 |
|
| 250 |
## License
|
| 251 |
|
| 252 |
-
|
| 253 |
-
|
| 254 |
|
|
|
|
| 10 |
short_description: Ed-Assistant summary your learning journey with Agentic RAG
|
| 11 |
---
|
| 12 |
|
| 13 |
+
### StudyBuddy (EdSummariser)
|
| 14 |
+
[Live demo](https://binkhoale1812-edsummariser.hf.space)
|
| 15 |
+
StudyBuddy is an end-to-end Retrieval-Augmented Generation (RAG) app for learning from your own documents. Upload PDF/DOCX files; the app extracts text and images, captions images, chunks content into semantic “cards,” embeds them in MongoDB, and serves a chat endpoint that answers strictly from your uploaded materials. It includes a lightweight chat-memory feature, cost-aware model routing, NVIDIA/Gemini integration, and robust key rotation/retries.
|
|
|
|
| 16 |
|
| 17 |
## Features
|
| 18 |
|
| 19 |
+
- **Document ingestion**: PDF/DOCX parsing (PyMuPDF, python-docx), image extraction, BLIP-based captions
|
| 20 |
+
- **Semantic chunking**: heuristic headings/size-based chunker → study cards with topic, summary, content
|
| 21 |
+
- **Embeddings**: Sentence-Transformers (`all-MiniLM-L6-v2`) with defensive fallbacks
|
| 22 |
- **Vector search**: MongoDB Atlas Vector Search (optional) or local cosine fallback
|
| 23 |
+
- **RAG chat**: cost-aware routing between NVIDIA and Gemini endpoints
|
| 24 |
+
- **Filename-aware questions**: detects filenames in questions (e.g., `JADE.pdf`) and prioritizes them
|
| 25 |
+
- **Classifier + fallbacks**: NVIDIA classifies file relevance; if retrieval is empty, the app retries (mentions-only, then all files) and finally falls back to file-level summaries
|
| 26 |
+
- **Chat memory**: per-user LRU of QA summaries; history relevance + semantic retrieval
|
| 27 |
+
- **Logging**: tagged logs per module, e.g., [APP], [RAG], [EMBED], [ROUTER]
|
| 28 |
- **Simple UI**: static frontend under `static/`
|
| 29 |
|
| 30 |
+
## Architecture
|
| 31 |
|
| 32 |
+
High level flow:
|
| 33 |
+
1) Upload PDF/DOCX → parse pages → extract images → BLIP captions → merge → chunk into cards → embed → store.
|
| 34 |
+
2) Chat request → detect any filenames in the question → preload filenames + summaries.
|
| 35 |
+
3) NVIDIA marks per-file relevance. Any filenames explicitly mentioned are always included.
|
| 36 |
+
4) Vector search restricted to relevant files. If no hits: retry with mentioned files only, then with all files. If still no hits but summaries exist, return those summaries.
|
| 37 |
+
5) Compose answer with strict guardrails to “answer from context only.” Summarize the Q/A and store in per-user LRU memory.
|
| 38 |
|
| 39 |
## Project Structure
|
| 40 |
|
|
|
|
| 57 |
requirements.txt # Python dependencies
|
| 58 |
```
|
| 59 |
|
| 60 |
+
## Prerequisites
|
| 61 |
+
- Python 3.10+
|
| 62 |
+
- MongoDB (local or Atlas). Collections are created automatically
|
| 63 |
+
- Optional: NVIDIA and/or Gemini API keys
|
| 64 |
+
|
| 65 |
+
## Setup (local)
|
| 66 |
|
| 67 |
```bash
|
| 68 |
python -m venv .venv && source .venv/bin/activate
|
| 69 |
pip install -r requirements.txt
|
| 70 |
export MONGO_URI="mongodb://localhost:27017"
|
| 71 |
+
uvicorn app:app --reload --host 0.0.0.0 --port 8000
|
| 72 |
```
|
| 73 |
|
| 74 |
+
Open the UI at `http://localhost:8000/static/`
|
| 75 |
|
| 76 |
+
Health check: `http://localhost:8000/healthz`
|
| 77 |
|
| 78 |
## Configuration
|
| 79 |
|
|
|
|
| 90 |
- **NVIDIA_SMALL**: override default NVIDIA small model
|
| 91 |
- Optional logging controls: use process env like `PYTHONWARNINGS=ignore` and manage verbosity per logger if needed
|
| 92 |
|
| 93 |
+
Logs are emitted at INFO level to stdout with module tags. See `utils/logger.py`.
|
| 94 |
|
| 95 |
## Running (Local)
|
| 96 |
|
|
|
|
| 119 |
|
| 120 |
For production, consider `--restart unless-stopped` and setting `--env ATLAS_VECTOR=1` if using Atlas Vector Search.
|
| 121 |
|
| 122 |
+
## Usage
|
| 123 |
+
UI:
|
| 124 |
+
- Open `http://localhost:8000/static/`
|
| 125 |
+
- Upload PDF/DOCX
|
| 126 |
+
- Ask questions. You can reference filenames, e.g., “Give me a summary on `JADE.pdf` …
|
| 127 |
+
API:
|
| 128 |
+
- `GET /` → serves `static/index.html`
|
| 129 |
+
- `POST /upload` (multipart form-data)
|
| 130 |
+
- fields: `user_id` (str), `project_id` (str), `files` (one or more PDF/DOCX)
|
| 131 |
+
- response: `{ job_id, status: "processing", total_files }`; background ingestion continues
|
| 132 |
+
- `GET /upload/status?job_id=...` → progress
|
| 133 |
+
- `GET /files?user_id=&project_id=` → filenames + summaries
|
| 134 |
+
- `GET /file-summary?user_id=&project_id=&filename=` → `{ filename, summary }`
|
| 135 |
+
- `POST /chat` (form)
|
| 136 |
+
- fields: `user_id`, `project_id`, `question`, `k` (default 6)
|
| 137 |
+
- behavior:
|
| 138 |
+
- If the question directly asks for a summary/about of a single mentioned file, returns that file’s stored summary.
|
| 139 |
+
- Otherwise: NVIDIA relevance classification → vector search (restricted) → retries → summary fallback when needed.
|
| 140 |
+
- returns `{ answer, sources, relevant_files }`
|
| 141 |
+
|
| 142 |
+
Example chat cURL:
|
| 143 |
|
| 144 |
```bash
|
| 145 |
curl -X POST http://localhost:8000/chat \
|
| 146 |
-H 'Content-Type: application/x-www-form-urlencoded' \
|
| 147 |
-d 'user_id=user1' \
|
| 148 |
+
-d 'project_id=demo' \
|
| 149 |
+
--data-urlencode 'question=Give me a summary on JADE.pdf and setup steps'
|
| 150 |
```
|
| 151 |
|
| 152 |
Upload example:
|
|
|
|
| 155 |
curl -X POST http://localhost:8000/upload \
|
| 156 |
-H 'Content-Type: multipart/form-data' \
|
| 157 |
-F 'user_id=user1' \
|
| 158 |
+
-F 'project_id=demo' \
|
| 159 |
-F 'files=@/path/to/file1.pdf' \
|
| 160 |
-F 'files=@/path/to/file2.docx'
|
| 161 |
```
|
| 162 |
|
| 163 |
+
## Data Model
|
| 164 |
|
| 165 |
+
- Collection `chunks` (per card):
|
| 166 |
+
- `user_id`, `project_id`, `filename`, `topic_name`, `summary`, `content`, `page_span`, `card_id`, `embedding[384]`
|
| 167 |
+
- Collection `files` (per file):
|
| 168 |
+
- `user_id`, `project_id`, `filename`, `summary`
|
| 169 |
|
| 170 |
+
### Atlas Vector Index (optional)
|
| 171 |
|
| 172 |
+
If using Atlas Vector Search, create an index similar to:
|
| 173 |
|
| 174 |
```json
|
| 175 |
{
|
|
|
|
| 186 |
}
|
| 187 |
```
|
| 188 |
|
| 189 |
+
Set `ATLAS_VECTOR=1` and configure `MONGO_VECTOR_INDEX`.
|
| 190 |
|
| 191 |
+
### Schema overview:
|
| 192 |
|
| 193 |
- Collection `chunks` (per card):
|
| 194 |
- `user_id` (str), `filename` (str), `topic_name` (str), `summary` (str), `content` (str)
|
|
|
|
| 203 |
- NVIDIA and Gemini calls use a simple key rotator. Provide one or more keys via `NVIDIA_API_1..5`, `GEMINI_API_1..5`.
|
| 204 |
- The app is defensive: if embeddings or summarization models are unavailable, it falls back to naive strategies to keep the app responsive (with reduced quality).
|
| 205 |
|
| 206 |
+
### Logging and Observability
|
| 207 |
|
| 208 |
- Logs are tagged by module via `utils/logger.py`:
|
| 209 |
- [APP] app lifecycle, ingestion, chat flow
|
|
|
|
| 214 |
- [CHUNKER]/[SUM]/[COMMON]/[PARSER] module-specific messages
|
| 215 |
- Change verbosity by setting the root logger level in code if needed
|
| 216 |
|
| 217 |
+
### Performance and Cost Tips
|
| 218 |
|
| 219 |
- Disable image captioning if CPU-bound by short-circuiting in `utils/caption.py` (return "")
|
| 220 |
- Use smaller `k` in `/chat` for fewer chunks
|
|
|
|
| 222 |
- If Atlas Vector is unavailable, local cosine search samples up to 2000 docs; tune in `utils/rag.py`
|
| 223 |
- Run with `--workers` and consider a process manager for production
|
| 224 |
|
| 225 |
+
#$# Retriver Functionalities
|
| 226 |
+
|
| 227 |
+
- Filename detection: regex captures tokens ending with `.pdf|.docx|.doc` in the user question; preceding prose is not captured.
|
| 228 |
+
- Relevance: NVIDIA classifies files by relevance to the question; any explicitly mentioned filenames are force-included.
|
| 229 |
+
- Retrieval: vector search is run over relevant files; on empty hits, it retries with mentions-only, then with all files.
|
| 230 |
+
- Fallback: if retrieval yields no chunks but file summaries exist, the app returns a composed summary response.
|
| 231 |
+
- Guardrails: responses are instructed to answer only from provided context and to admit when unknown.
|
| 232 |
+
- “I don’t know…” often means no chunks were retrieved:
|
| 233 |
+
- Verify ingestion finished: `GET /upload/status`
|
| 234 |
+
- Confirm files exist: `GET /files`
|
| 235 |
+
- Try `GET /file-summary` to ensure summaries exist
|
| 236 |
+
- Check logs around `[APP] [CHAT]` for relevance, retries, and fallbacks
|
| 237 |
+
- NVIDIA/Gemini API: ensure keys are set (`NVIDIA_API_1..`, `GEMINI_API_1..`). See `[ROUTER]`/`[ROTATOR]` logs.
|
| 238 |
+
- Atlas Vector: set `ATLAS_VECTOR=1` and ensure the index exists; otherwise local cosine fallback is used.
|
| 239 |
+
- Performance: disable BLIP captions in `utils/caption.py` if CPU-bound; reduce `k` in `/chat`.
|
| 240 |
+
|
| 241 |
## Security Notes
|
| 242 |
|
| 243 |
- CORS is currently open (`allow_origins=["*"]`) for simplicity. Restrict in production
|
| 244 |
- Validate and limit upload sizes at the reverse proxy (e.g., nginx) or add checks in `/upload`
|
| 245 |
- Secrets are passed via environment; avoid committing them
|
| 246 |
|
| 247 |
+
|
| 248 |
## Troubleshooting
|
| 249 |
|
| 250 |
- Missing Python packages: install via `pip install -r requirements.txt`.
|
|
|
|
| 277 |
|
| 278 |
## License
|
| 279 |
|
| 280 |
+
**Apache-2.0**
|
|
|
|
| 281 |
|