LiamKhoaLe commited on
Commit
bbcd957
·
1 Parent(s): 1597d99

Upd README

Browse files
Files changed (1) hide show
  1. README.md +81 -54
README.md CHANGED
@@ -10,29 +10,31 @@ license: apache-2.0
10
  short_description: Ed-Assistant summary your learning journey with Agentic RAG
11
  ---
12
 
13
- ### StudyBuddy
14
- [Access Demo](https://binkhoale1812-edsummariser.hf.space)
15
-
16
- An end-to-end RAG (Retrieval-Augmented Generation) app for studying from your own documents. Upload PDF/DOCX files, the app extracts text and images, captions images, chunks into semantic "cards", embeds and stores them in MongoDB, and serves a chat endpoint that answers strictly from your uploaded materials. Includes a lightweight chat-memory feature to improve context continuity, cost-aware model routing, and robust provider retries.
17
 
18
  ## Features
19
 
20
- - **Document ingestion**: PDF/DOCX parsing (PyMuPDF, python-docx), image extraction and BLIP-based captions
21
- - **Semantic chunking**: heuristic heading/size-based chunker
22
- - **Embeddings**: Sentence-Transformers (all-MiniLM-L6-v2 by default) with random fallback when unavailable
23
  - **Vector search**: MongoDB Atlas Vector Search (optional) or local cosine fallback
24
- - **RAG chat**: cost-aware routing between Gemini and NVIDIA endpoints
25
- - **Chat memory**: per-user LRU of recent QA summaries; history and semantic retrieval to augment context
26
- - **Summarization**: cheap extractive summaries via sumy with naive fallback
27
- - **Centralized logging**: tagged loggers per module, e.g., [APP], [RAG], [CHUNKER]
 
28
  - **Simple UI**: static frontend under `static/`
29
 
30
- ## Prerequisites
31
 
32
- - Python 3.10+
33
- - MongoDB instance (local or Atlas). Collections are created automatically
34
- - Optional: NVIDIA and/or Gemini API keys for model calls
35
- - Optional but recommended: a virtual environment
 
 
36
 
37
  ## Project Structure
38
 
@@ -55,18 +57,23 @@ Dockerfile # container image
55
  requirements.txt # Python dependencies
56
  ```
57
 
58
- ## Quickstart (Local)
 
 
 
 
 
59
 
60
  ```bash
61
  python -m venv .venv && source .venv/bin/activate
62
  pip install -r requirements.txt
63
  export MONGO_URI="mongodb://localhost:27017"
64
- uvicorn app:app --reload
65
  ```
66
 
67
- Open UI: `http://localhost:8000/static/`
68
 
69
- Health: `http://localhost:8000/healthz`
70
 
71
  ## Configuration
72
 
@@ -83,7 +90,7 @@ Environment variables:
83
  - **NVIDIA_SMALL**: override default NVIDIA small model
84
  - Optional logging controls: use process env like `PYTHONWARNINGS=ignore` and manage verbosity per logger if needed
85
 
86
- Logging: Logs are sent to stdout at INFO level, tagged per module, e.g., `[APP]`, `[RAG]`. See `utils/logger.py`.
87
 
88
  ## Running (Local)
89
 
@@ -112,32 +119,34 @@ docker run --rm -p 8000:8000 \
112
 
113
  For production, consider `--restart unless-stopped` and setting `--env ATLAS_VECTOR=1` if using Atlas Vector Search.
114
 
115
- ## API Overview
116
-
117
- - GET `/` → serves `static/index.html`
118
- - POST `/upload` (multipart form-data)
119
- - fields: `user_id` (str), `files` (one or more PDF/DOCX)
120
- - response: `{ job_id, status: "processing" }`; ingestion proceeds in background
121
- - GET `/cards`
122
- - params: `user_id` (str), `filename` (optional), `limit` (int), `skip` (int)
123
- - returns stored cards without embeddings
124
- - GET `/file-summary`
125
- - params: `user_id`, `filename`
126
- - returns `{ filename, summary }`
127
- - POST `/chat` (form-urlencoded)
128
- - fields: `user_id`, `question`, `k` (int, default 6)
129
- - logic:
130
- - If question matches "what is <file> about?": returns file summary
131
- - Else: classify relevant files via NVIDIA, augment with chat memory context, run vector search (restricted to relevant files if any), select model, generate answer, store QA summary in LRU
132
- - returns `{ answer, sources }` (and `relevant_files` when no hits)
133
-
134
- Example cURL:
 
135
 
136
  ```bash
137
  curl -X POST http://localhost:8000/chat \
138
  -H 'Content-Type: application/x-www-form-urlencoded' \
139
  -d 'user_id=user1' \
140
- --data-urlencode 'question=Summarize reinforcement learning from the uploaded notes.'
 
141
  ```
142
 
143
  Upload example:
@@ -146,19 +155,21 @@ Upload example:
146
  curl -X POST http://localhost:8000/upload \
147
  -H 'Content-Type: multipart/form-data' \
148
  -F 'user_id=user1' \
 
149
  -F 'files=@/path/to/file1.pdf' \
150
  -F 'files=@/path/to/file2.docx'
151
  ```
152
 
153
- List cards:
154
 
155
- ```bash
156
- curl 'http://localhost:8000/cards?user_id=user1&limit=10'
157
- ```
 
158
 
159
- ## MongoDB Atlas Vector Index (optional)
160
 
161
- If using Atlas Vector Search, create an index (UI or API) similar to:
162
 
163
  ```json
164
  {
@@ -175,9 +186,9 @@ If using Atlas Vector Search, create an index (UI or API) similar to:
175
  }
176
  ```
177
 
178
- Set `ATLAS_VECTOR=1` and `MONGO_VECTOR_INDEX` accordingly.
179
 
180
- Schema overview:
181
 
182
  - Collection `chunks` (per card):
183
  - `user_id` (str), `filename` (str), `topic_name` (str), `summary` (str), `content` (str)
@@ -192,7 +203,7 @@ Schema overview:
192
  - NVIDIA and Gemini calls use a simple key rotator. Provide one or more keys via `NVIDIA_API_1..5`, `GEMINI_API_1..5`.
193
  - The app is defensive: if embeddings or summarization models are unavailable, it falls back to naive strategies to keep the app responsive (with reduced quality).
194
 
195
- ## Logging and Observability
196
 
197
  - Logs are tagged by module via `utils/logger.py`:
198
  - [APP] app lifecycle, ingestion, chat flow
@@ -203,7 +214,7 @@ Schema overview:
203
  - [CHUNKER]/[SUM]/[COMMON]/[PARSER] module-specific messages
204
  - Change verbosity by setting the root logger level in code if needed
205
 
206
- ## Performance and Cost Tips
207
 
208
  - Disable image captioning if CPU-bound by short-circuiting in `utils/caption.py` (return "")
209
  - Use smaller `k` in `/chat` for fewer chunks
@@ -211,12 +222,29 @@ Schema overview:
211
  - If Atlas Vector is unavailable, local cosine search samples up to 2000 docs; tune in `utils/rag.py`
212
  - Run with `--workers` and consider a process manager for production
213
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
214
  ## Security Notes
215
 
216
  - CORS is currently open (`allow_origins=["*"]`) for simplicity. Restrict in production
217
  - Validate and limit upload sizes at the reverse proxy (e.g., nginx) or add checks in `/upload`
218
  - Secrets are passed via environment; avoid committing them
219
 
 
220
  ## Troubleshooting
221
 
222
  - Missing Python packages: install via `pip install -r requirements.txt`.
@@ -249,6 +277,5 @@ docker-run:
249
 
250
  ## License
251
 
252
- MIT (or your preferred license). Replace this section if needed.
253
-
254
 
 
10
  short_description: Ed-Assistant summary your learning journey with Agentic RAG
11
  ---
12
 
13
+ ### StudyBuddy (EdSummariser)
14
+ [Live demo](https://binkhoale1812-edsummariser.hf.space)
15
+ StudyBuddy is an end-to-end Retrieval-Augmented Generation (RAG) app for learning from your own documents. Upload PDF/DOCX files; the app extracts text and images, captions images, chunks content into semantic “cards,” embeds them in MongoDB, and serves a chat endpoint that answers strictly from your uploaded materials. It includes a lightweight chat-memory feature, cost-aware model routing, NVIDIA/Gemini integration, and robust key rotation/retries.
 
16
 
17
  ## Features
18
 
19
+ - **Document ingestion**: PDF/DOCX parsing (PyMuPDF, python-docx), image extraction, BLIP-based captions
20
+ - **Semantic chunking**: heuristic headings/size-based chunker → study cards with topic, summary, content
21
+ - **Embeddings**: Sentence-Transformers (`all-MiniLM-L6-v2`) with defensive fallbacks
22
  - **Vector search**: MongoDB Atlas Vector Search (optional) or local cosine fallback
23
+ - **RAG chat**: cost-aware routing between NVIDIA and Gemini endpoints
24
+ - **Filename-aware questions**: detects filenames in questions (e.g., `JADE.pdf`) and prioritizes them
25
+ - **Classifier + fallbacks**: NVIDIA classifies file relevance; if retrieval is empty, the app retries (mentions-only, then all files) and finally falls back to file-level summaries
26
+ - **Chat memory**: per-user LRU of QA summaries; history relevance + semantic retrieval
27
+ - **Logging**: tagged logs per module, e.g., [APP], [RAG], [EMBED], [ROUTER]
28
  - **Simple UI**: static frontend under `static/`
29
 
30
+ ## Architecture
31
 
32
+ High level flow:
33
+ 1) Upload PDF/DOCX parse pages extract images → BLIP captions → merge → chunk into cards → embed → store.
34
+ 2) Chat request detect any filenames in the question → preload filenames + summaries.
35
+ 3) NVIDIA marks per-file relevance. Any filenames explicitly mentioned are always included.
36
+ 4) Vector search restricted to relevant files. If no hits: retry with mentioned files only, then with all files. If still no hits but summaries exist, return those summaries.
37
+ 5) Compose answer with strict guardrails to “answer from context only.” Summarize the Q/A and store in per-user LRU memory.
38
 
39
  ## Project Structure
40
 
 
57
  requirements.txt # Python dependencies
58
  ```
59
 
60
+ ## Prerequisites
61
+ - Python 3.10+
62
+ - MongoDB (local or Atlas). Collections are created automatically
63
+ - Optional: NVIDIA and/or Gemini API keys
64
+
65
+ ## Setup (local)
66
 
67
  ```bash
68
  python -m venv .venv && source .venv/bin/activate
69
  pip install -r requirements.txt
70
  export MONGO_URI="mongodb://localhost:27017"
71
+ uvicorn app:app --reload --host 0.0.0.0 --port 8000
72
  ```
73
 
74
+ Open the UI at `http://localhost:8000/static/`
75
 
76
+ Health check: `http://localhost:8000/healthz`
77
 
78
  ## Configuration
79
 
 
90
  - **NVIDIA_SMALL**: override default NVIDIA small model
91
  - Optional logging controls: use process env like `PYTHONWARNINGS=ignore` and manage verbosity per logger if needed
92
 
93
+ Logs are emitted at INFO level to stdout with module tags. See `utils/logger.py`.
94
 
95
  ## Running (Local)
96
 
 
119
 
120
  For production, consider `--restart unless-stopped` and setting `--env ATLAS_VECTOR=1` if using Atlas Vector Search.
121
 
122
+ ## Usage
123
+ UI:
124
+ - Open `http://localhost:8000/static/`
125
+ - Upload PDF/DOCX
126
+ - Ask questions. You can reference filenames, e.g., “Give me a summary on `JADE.pdf` …
127
+ API:
128
+ - `GET /` → serves `static/index.html`
129
+ - `POST /upload` (multipart form-data)
130
+ - fields: `user_id` (str), `project_id` (str), `files` (one or more PDF/DOCX)
131
+ - response: `{ job_id, status: "processing", total_files }`; background ingestion continues
132
+ - `GET /upload/status?job_id=...` → progress
133
+ - `GET /files?user_id=&project_id=` filenames + summaries
134
+ - `GET /file-summary?user_id=&project_id=&filename=` → `{ filename, summary }`
135
+ - `POST /chat` (form)
136
+ - fields: `user_id`, `project_id`, `question`, `k` (default 6)
137
+ - behavior:
138
+ - If the question directly asks for a summary/about of a single mentioned file, returns that file’s stored summary.
139
+ - Otherwise: NVIDIA relevance classification vector search (restricted) retries → summary fallback when needed.
140
+ - returns `{ answer, sources, relevant_files }`
141
+
142
+ Example chat cURL:
143
 
144
  ```bash
145
  curl -X POST http://localhost:8000/chat \
146
  -H 'Content-Type: application/x-www-form-urlencoded' \
147
  -d 'user_id=user1' \
148
+ -d 'project_id=demo' \
149
+ --data-urlencode 'question=Give me a summary on JADE.pdf and setup steps'
150
  ```
151
 
152
  Upload example:
 
155
  curl -X POST http://localhost:8000/upload \
156
  -H 'Content-Type: multipart/form-data' \
157
  -F 'user_id=user1' \
158
+ -F 'project_id=demo' \
159
  -F 'files=@/path/to/file1.pdf' \
160
  -F 'files=@/path/to/file2.docx'
161
  ```
162
 
163
+ ## Data Model
164
 
165
+ - Collection `chunks` (per card):
166
+ - `user_id`, `project_id`, `filename`, `topic_name`, `summary`, `content`, `page_span`, `card_id`, `embedding[384]`
167
+ - Collection `files` (per file):
168
+ - `user_id`, `project_id`, `filename`, `summary`
169
 
170
+ ### Atlas Vector Index (optional)
171
 
172
+ If using Atlas Vector Search, create an index similar to:
173
 
174
  ```json
175
  {
 
186
  }
187
  ```
188
 
189
+ Set `ATLAS_VECTOR=1` and configure `MONGO_VECTOR_INDEX`.
190
 
191
+ ### Schema overview:
192
 
193
  - Collection `chunks` (per card):
194
  - `user_id` (str), `filename` (str), `topic_name` (str), `summary` (str), `content` (str)
 
203
  - NVIDIA and Gemini calls use a simple key rotator. Provide one or more keys via `NVIDIA_API_1..5`, `GEMINI_API_1..5`.
204
  - The app is defensive: if embeddings or summarization models are unavailable, it falls back to naive strategies to keep the app responsive (with reduced quality).
205
 
206
+ ### Logging and Observability
207
 
208
  - Logs are tagged by module via `utils/logger.py`:
209
  - [APP] app lifecycle, ingestion, chat flow
 
214
  - [CHUNKER]/[SUM]/[COMMON]/[PARSER] module-specific messages
215
  - Change verbosity by setting the root logger level in code if needed
216
 
217
+ ### Performance and Cost Tips
218
 
219
  - Disable image captioning if CPU-bound by short-circuiting in `utils/caption.py` (return "")
220
  - Use smaller `k` in `/chat` for fewer chunks
 
222
  - If Atlas Vector is unavailable, local cosine search samples up to 2000 docs; tune in `utils/rag.py`
223
  - Run with `--workers` and consider a process manager for production
224
 
225
+ #$# Retriver Functionalities
226
+
227
+ - Filename detection: regex captures tokens ending with `.pdf|.docx|.doc` in the user question; preceding prose is not captured.
228
+ - Relevance: NVIDIA classifies files by relevance to the question; any explicitly mentioned filenames are force-included.
229
+ - Retrieval: vector search is run over relevant files; on empty hits, it retries with mentions-only, then with all files.
230
+ - Fallback: if retrieval yields no chunks but file summaries exist, the app returns a composed summary response.
231
+ - Guardrails: responses are instructed to answer only from provided context and to admit when unknown.
232
+ - “I don’t know…” often means no chunks were retrieved:
233
+ - Verify ingestion finished: `GET /upload/status`
234
+ - Confirm files exist: `GET /files`
235
+ - Try `GET /file-summary` to ensure summaries exist
236
+ - Check logs around `[APP] [CHAT]` for relevance, retries, and fallbacks
237
+ - NVIDIA/Gemini API: ensure keys are set (`NVIDIA_API_1..`, `GEMINI_API_1..`). See `[ROUTER]`/`[ROTATOR]` logs.
238
+ - Atlas Vector: set `ATLAS_VECTOR=1` and ensure the index exists; otherwise local cosine fallback is used.
239
+ - Performance: disable BLIP captions in `utils/caption.py` if CPU-bound; reduce `k` in `/chat`.
240
+
241
  ## Security Notes
242
 
243
  - CORS is currently open (`allow_origins=["*"]`) for simplicity. Restrict in production
244
  - Validate and limit upload sizes at the reverse proxy (e.g., nginx) or add checks in `/upload`
245
  - Secrets are passed via environment; avoid committing them
246
 
247
+
248
  ## Troubleshooting
249
 
250
  - Missing Python packages: install via `pip install -r requirements.txt`.
 
277
 
278
  ## License
279
 
280
+ **Apache-2.0**
 
281