frkhan commited on
Commit
9536c67
·
1 Parent(s): b3c07b5

- Added docstring for the whole project

Browse files
Dockerfile CHANGED
@@ -8,7 +8,6 @@ COPY requirements.txt .
8
  RUN pip install --break-system-packages -r requirements.txt
9
  RUN python -m playwright install --with-deps chromium
10
 
11
- # RUN pip install watchfiles
12
 
13
  COPY . .
14
 
 
8
  RUN pip install --break-system-packages -r requirements.txt
9
  RUN python -m playwright install --with-deps chromium
10
 
 
11
 
12
  COPY . .
13
 
README.md CHANGED
@@ -7,15 +7,159 @@ sdk: docker
7
  app_port: 7860
8
  ---
9
 
10
- ## LLM Web Scraper
11
 
12
- This application uses Docker for deployment on Hugging Face Spaces.
13
 
14
- It combines web scraping with the power of Large Language Models (LLMs) to extract specific information from web pages.
15
 
16
- ### How to Use
17
- 1. **Enter a URL**: Provide the URL of the web page you want to analyze.
18
- 2. **Define Your Query**: Specify the exact information you're looking for.
19
- 3. **Scrape the Web Page**: Choose a scraper and extract the page content.
20
- 4. **Select Model & Provider**: Choose an LLM for information extraction.
21
- 5. **Extract Info by LLM**: Get a structured answer based on your query.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  app_port: 7860
8
  ---
9
 
10
+ # LLM Web Scraper (🕸️ → 🤖 → 🧠 → ❓ → 📄)
11
 
12
+ Scrape any web page, ask questions, and get structured answers powered by LangChain, FireCrawl, and leading LLMs from NVIDIA and Google—all wrapped in a clean Gradio interface.
13
 
14
+ 🔗 **Live Demo**: https://huggingface.co/spaces/frkhan/llm-web-scrapper
15
 
16
+ ---
17
+
18
+ ### 🚀 Features
19
+
20
+ - 🕸️ **Multi-Backend Scraping**: Choose between `FireCrawl` for robust, API-driven scraping and `Crawl4AI` for local, Playwright-based scraping.
21
+ - 🧠 **Intelligent Extraction**: Use powerful LLMs (NVIDIA or Google Gemini) to understand your query and extract specific information from scraped content.
22
+ - 📊 **Structured Output**: Get answers in markdown tables, JSON, or plain text, as requested.
23
+ - 📈 **Full Observability**: Integrated with `Langfuse` to trace both scraping and LLM-extraction steps.
24
+ - ✨ **Interactive UI**: A clean and simple interface built with `Gradio`.
25
+ - 🐳 **Docker-Ready**: Comes with `Dockerfile` and `docker-compose` configurations for easy local and production deployment.
26
+
27
+ ---
28
+
29
+ ### 🛠️ Tech Stack
30
+
31
+ | Component | Purpose |
32
+ | :--- | :--- |
33
+ | **LangChain** | Orchestration of LLM calls |
34
+ | **FireCrawl / Crawl4AI** | Web scraping backends |
35
+ | **NVIDIA / Gemini** | LLM APIs for information extraction |
36
+ | **Langfuse** | Tracing and observability for all operations |
37
+ | **Gradio** | Interactive web UI |
38
+ | **Docker** | Containerized deployment |
39
+
40
+ ---
41
+
42
+ ## 📦 Installation
43
+
44
+ ### Option 1: Run Locally
45
+
46
+ 1. **Clone the repository:**
47
+ ```bash
48
+ git clone https://github.com/KI-IAN/llm-web-scrapper.git
49
+ cd llm-web-scrapper
50
+ ```
51
+
52
+ 2. **Install dependencies:**
53
+ ```bash
54
+ pip install -r requirements.txt
55
+ ```
56
+
57
+ 3. **Install Playwright browsers (for Crawl4AI):**
58
+ ```bash
59
+ playwright install
60
+ ```
61
+
62
+ 4. **Create a `.env` file** in the root directory with your API keys:
63
+ ```env
64
+ GOOGLE_API_KEY=your_google_api_key
65
+ NVIDIA_API_KEY=your_nvidia_api_key
66
+ FIRECRAWL_API_KEY=your_firecrawl_api_key
67
+
68
+ # Optional: For Langfuse tracing
69
+ LANGFUSE_PUBLIC_KEY=pk-lf-...
70
+ LANGFUSE_SECRET_KEY=sk-lf-...
71
+ LANGFUSE_HOST=https://cloud.langfuse.com
72
+ ```
73
+
74
+ 5. **Run the application:**
75
+ ```bash
76
+ python app.py
77
+ ```
78
+
79
+ ---
80
+
81
+ ### Option 2: Run with Docker
82
+
83
+ 1. **For Production:**
84
+ This uses the standard `docker-compose.yml`.
85
+ ```bash
86
+ docker compose up --build
87
+ ```
88
+
89
+ 2. **For Local Development (with live code reload):**
90
+ This uses `docker-compose.dev.yml` to mount your local code into the container.
91
+ ```bash
92
+ docker compose -f docker-compose.dev.yml up --build
93
+ ```
94
+
95
+ Access the app at http://localhost:12200.
96
+
97
+ ---
98
+
99
+ ## 🔑 Getting API Keys
100
+
101
+ To use this app, you'll need API keys for **Google Gemini**, **NVIDIA NIM**, and **FireCrawl**. For full observability, you'll also need keys for **Langfuse**.
102
+
103
+ - **Google Gemini API Key**:
104
+ 1. Visit the Google AI Studio.
105
+ 2. Click **"Create API Key"** and copy the key.
106
+
107
+ - **NVIDIA NIM API Key**:
108
+ 1. Go to the NVIDIA API Catalog.
109
+ 2. Choose a model, go to the "API" tab, and click **"Get API Key"**.
110
+
111
+ - **FireCrawl API Key**:
112
+ 1. Sign up at FireCrawl.dev.
113
+ 2. Find your API key in the dashboard.
114
+
115
+ - **Langfuse API Keys (Optional)**:
116
+ 1. Sign up or log in at [Langfuse Cloud](https://cloud.langfuse.com/).
117
+ 2. Navigate to your project settings and then to the "API Keys" tab.
118
+ 3. Create a new key pair to get your `LANGFUSE_PUBLIC_KEY` (starts with `pk-lf-...`) and `LANGFUSE_SECRET_KEY` (starts with `sk-lf-...`).
119
+ 4. Add these to your `.env` file to enable tracing.
120
+
121
+ ---
122
+
123
+ ## 🧪 How to Use
124
+
125
+ 1. **Enter a URL**: Provide the URL of the web page you want to analyze.
126
+ 2. **Define Your Query**: Specify what you want to extract (e.g., "product name, price, and rating" or "summarize this article").
127
+ 3. **Scrape the Web Page**: Choose a scraper (`Crawl4AI` or `FireCrawl`) and click **"Scrape Website"**.
128
+ 4. **Select Model & Provider**: Choose an LLM to process the scraped content.
129
+ 5. **Extract Info**: Click **"Extract Info by LLM"** to get a structured answer.
130
+
131
+ ---
132
+
133
+ ### 📁 File Structure
134
+
135
+ ```
136
+ llm-web-scrapper/
137
+ ├── .env # Local environment variables (not tracked by git)
138
+ ├── .github/ # GitHub Actions workflows
139
+ ├── .gitignore
140
+ ├── docker-compose.yml # Production Docker configuration
141
+ ├── docker-compose.dev.yml# Development Docker configuration
142
+ ├── Dockerfile
143
+ ├── requirements.txt
144
+ ├── app.py # Gradio UI and application logic
145
+ ├── config.py # Environment variable loading
146
+ ├── crawl4ai_client.py # Client for Crawl4AI scraper
147
+ ├── firecrawl_client.py # Client for FireCrawl scraper
148
+ └── llm_inference_service.py # Logic for LLM calls
149
+ ```
150
+
151
+ ---
152
+
153
+ ## 📜 License
154
+
155
+ This project is open-source and distributed under the **MIT License**. Feel free to use, modify, and distribute it.
156
+
157
+ ---
158
+
159
+ ## 🤝 Acknowledgements
160
+
161
+ - LangChain for orchestrating LLM interactions.
162
+ - FireCrawl & Crawl4AI for providing powerful scraping backends.
163
+ - NVIDIA AI Endpoints & Google Gemini for their state-of-the-art LLMs.
164
+ - Langfuse for providing excellent observability tools.
165
+ - Gradio for making UI creation simple and elegant.
app.py CHANGED
@@ -1,3 +1,11 @@
 
 
 
 
 
 
 
 
1
  import gradio as gr
2
  import firecrawl_client
3
  import crawl4ai_client
@@ -16,7 +24,20 @@ if LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY:
16
  langfuse = get_client()
17
 
18
  def parse_model_provider(selection):
19
- # Expected format: "<model_name> (<provider>)"
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  if "(" in selection and ")" in selection:
21
  model = selection.split(" (")[0].strip()
22
  provider = selection.split(" (")[1].replace(")", "").strip()
@@ -24,6 +45,21 @@ def parse_model_provider(selection):
24
  raise ValueError(f"Invalid selection format: {selection}")
25
 
26
  def llm_response_wrapper(query, scrape_result, model_provider_selection, progress=gr.Progress(track_tqdm=True)):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  yield "⏳ Generating response... Please wait."
28
 
29
  model, provider = parse_model_provider(model_provider_selection)
@@ -33,9 +69,19 @@ def llm_response_wrapper(query, scrape_result, model_provider_selection, progres
33
  yield result
34
 
35
  async def scrape_website(url, scraper_selection, progress=gr.Progress(track_tqdm=True)):
36
- """
37
- Performs the scraping and yields Gradio component updates directly.
38
- This generator pattern is the most reliable way to handle sequential UI updates.
 
 
 
 
 
 
 
 
 
 
39
  """
40
  # 1. First, yield an update to show the loading state and hide the old image.
41
  yield "⏳ Scraping website... Please wait."
@@ -69,6 +115,7 @@ async def scrape_website(url, scraper_selection, progress=gr.Progress(track_tqdm
69
  yield markdown
70
 
71
  #Gradio UI
 
72
  with gr.Blocks() as gradio_ui:
73
  gr.HTML("""
74
  <div style="display: flex; align-items: center; gap: 20px; flex-wrap: wrap; margin-bottom: 20px;">
@@ -127,14 +174,13 @@ with gr.Blocks() as gradio_ui:
127
 
128
  with gr.Column():
129
  url_input = gr.Textbox(label="Enter URL to scrape", placeholder="https://example.com/query?search=cat+food", lines=1)
130
- # search_query_input = gr.Textbox(label="Enter your query", placeholder="Paw paw fish adult cat food", lines=1)
131
- query_input = gr.Textbox(label="What information do you want to find?", placeholder="Find product name, price, rating", lines=1)
132
 
133
  with gr.Row():
134
  scraper_dropdown = gr.Dropdown(
135
  label="Select Scraper",
136
- choices=["Scrape with FireCrawl", "Scrape with Crawl4AI"],
137
- value="Scrape with FireCrawl"
138
  )
139
  scrape_btn = gr.Button("Scrape Website")
140
  clear_btn = gr.Button("Clear")
 
1
+ """
2
+ This module sets up and runs the Gradio web interface for the LLM Web Scraper application.
3
+
4
+ It orchestrates the UI components, event handling for scraping and LLM extraction,
5
+ and integrates with backend services for scraping (FireCrawl, Crawl4AI) and
6
+ LLM inference. It also initializes and uses Langfuse for tracing application performance.
7
+ """
8
+
9
  import gradio as gr
10
  import firecrawl_client
11
  import crawl4ai_client
 
24
  langfuse = get_client()
25
 
26
  def parse_model_provider(selection):
27
+ """
28
+ Parses a model and provider from a selection string.
29
+
30
+ The expected format is "<model_name> (<provider>)".
31
+
32
+ Args:
33
+ selection (str): The string to parse.
34
+
35
+ Returns:
36
+ tuple[str, str]: A tuple containing the model name and provider.
37
+
38
+ Raises:
39
+ ValueError: If the selection string is not in the expected format.
40
+ """
41
  if "(" in selection and ")" in selection:
42
  model = selection.split(" (")[0].strip()
43
  provider = selection.split(" (")[1].replace(")", "").strip()
 
45
  raise ValueError(f"Invalid selection format: {selection}")
46
 
47
  def llm_response_wrapper(query, scrape_result, model_provider_selection, progress=gr.Progress(track_tqdm=True)):
48
+ """
49
+ A generator function that wraps the LLM inference call for the Gradio UI.
50
+
51
+ It yields an initial status message, calls the LLM service to extract information,
52
+ and then yields the final result or an error message.
53
+
54
+ Args:
55
+ query (str): The user's query for information extraction.
56
+ scrape_result (str): The scraped markdown content from the website.
57
+ model_provider_selection (str): The selected model and provider string.
58
+ progress (gr.Progress, optional): Gradio progress tracker. Defaults to gr.Progress(track_tqdm=True).
59
+
60
+ Yields:
61
+ str: Status messages and the final LLM response as a markdown string.
62
+ """
63
  yield "⏳ Generating response... Please wait."
64
 
65
  model, provider = parse_model_provider(model_provider_selection)
 
69
  yield result
70
 
71
  async def scrape_website(url, scraper_selection, progress=gr.Progress(track_tqdm=True)):
72
+ """An async generator that scrapes a website based on user selection for the Gradio UI.
73
+
74
+ This function yields an initial status message, then performs the web scraping
75
+ using the selected tool (FireCrawl or Crawl4AI). If Langfuse is configured,
76
+ it wraps the scraping operation in a trace for observability.
77
+
78
+ Args:
79
+ url (str): The URL of the website to scrape.
80
+ scraper_selection (str): The scraping tool selected by the user.
81
+ progress (gr.Progress, optional): Gradio progress tracker. Defaults to gr.Progress(track_tqdm=True).
82
+
83
+ Yields:
84
+ str: A status message, followed by the scraped markdown content or an error message.
85
  """
86
  # 1. First, yield an update to show the loading state and hide the old image.
87
  yield "⏳ Scraping website... Please wait."
 
115
  yield markdown
116
 
117
  #Gradio UI
118
+ # This block defines the entire Gradio user interface, including layout and component interactions.
119
  with gr.Blocks() as gradio_ui:
120
  gr.HTML("""
121
  <div style="display: flex; align-items: center; gap: 20px; flex-wrap: wrap; margin-bottom: 20px;">
 
174
 
175
  with gr.Column():
176
  url_input = gr.Textbox(label="Enter URL to scrape", placeholder="https://example.com/query?search=cat+food", lines=1)
177
+ query_input = gr.Textbox(label="What information do you want to find?", placeholder="Find product name, price, rating etc. / Summarize the content of this page", lines=2)
 
178
 
179
  with gr.Row():
180
  scraper_dropdown = gr.Dropdown(
181
  label="Select Scraper",
182
+ choices=["Scrape with Crawl4AI", "Scrape with FireCrawl"],
183
+ value="Scrape with Crawl4AI"
184
  )
185
  scrape_btn = gr.Button("Scrape Website")
186
  clear_btn = gr.Button("Clear")
crawl4ai_client.py CHANGED
@@ -1,8 +1,24 @@
1
- import asyncio
 
 
 
 
 
 
2
  from crawl4ai import AsyncWebCrawler
3
 
4
 
5
  async def scrape_and_get_markdown_with_crawl4ai(url: str) -> str:
 
 
 
 
 
 
 
 
 
 
6
  try:
7
  async with AsyncWebCrawler() as crawler:
8
  result = await crawler.arun(url=url)
 
1
+ """
2
+ This module provides a client for interacting with the Crawl4AI library.
3
+
4
+ It encapsulates the logic for scraping a website using Crawl4AI and extracting
5
+ its content as a markdown string, handling potential errors during the process.
6
+ """
7
+
8
  from crawl4ai import AsyncWebCrawler
9
 
10
 
11
  async def scrape_and_get_markdown_with_crawl4ai(url: str) -> str:
12
+ """
13
+ Asynchronously scrapes a given URL using Crawl4AI and returns its content as markdown.
14
+
15
+ Args:
16
+ url (str): The URL of the website to scrape.
17
+
18
+ Returns:
19
+ str: The scraped content in markdown format. If scraping fails or returns
20
+ no content, a formatted error message string is returned.
21
+ """
22
  try:
23
  async with AsyncWebCrawler() as crawler:
24
  result = await crawler.arun(url=url)
docker-compose.dev.yml CHANGED
@@ -12,9 +12,9 @@ services:
12
  - NVIDIA_API_KEY=${NVIDIA_API_KEY} # Load this key from .env in local/dev environment
13
  - GOOGLE_API_KEY=${GOOGLE_API_KEY} # Load this key from .env in local/dev environment
14
  - FIRECRAWL_API_KEY=${FIRECRAWL_API_KEY} # Load this key from .env in local/dev environment
15
- - LANGFUSE_PUBLIC_KEY=${LANGFUSE_PUBLIC_KEY}
16
- - LANGFUSE_SECRET_KEY=${LANGFUSE_SECRET_KEY}
17
- - LANGFUSE_HOST=${LANGFUSE_HOST}
18
  volumes:
19
  - .:/app:rw # This is for local development. Docker reads the code from the host machine. Changes on the host are reflected in the container.
20
  restart: unless-stopped
 
12
  - NVIDIA_API_KEY=${NVIDIA_API_KEY} # Load this key from .env in local/dev environment
13
  - GOOGLE_API_KEY=${GOOGLE_API_KEY} # Load this key from .env in local/dev environment
14
  - FIRECRAWL_API_KEY=${FIRECRAWL_API_KEY} # Load this key from .env in local/dev environment
15
+ - LANGFUSE_PUBLIC_KEY=${LANGFUSE_PUBLIC_KEY} # Load this key from .env in local/dev environment
16
+ - LANGFUSE_SECRET_KEY=${LANGFUSE_SECRET_KEY} # Load this key from .env in local/dev environment
17
+ - LANGFUSE_HOST=${LANGFUSE_HOST} # Load this key from .env in local/dev environment
18
  volumes:
19
  - .:/app:rw # This is for local development. Docker reads the code from the host machine. Changes on the host are reflected in the container.
20
  restart: unless-stopped
docker-compose.yml CHANGED
@@ -12,7 +12,7 @@ services:
12
  - NVIDIA_API_KEY=${NVIDIA_API_KEY} # Load this key from .env or manually add the secret
13
  - GOOGLE_API_KEY=${GOOGLE_API_KEY} # Load this key from .env or manually add the secret
14
  - FIRECRAWL_API_KEY=${FIRECRAWL_API_KEY} # Load this key from .env in local/dev environment
15
- - LANGFUSE_PUBLIC_KEY=${LANGFUSE_PUBLIC_KEY}
16
- - LANGFUSE_SECRET_KEY=${LANGFUSE_SECRET_KEY}
17
- - LANGFUSE_HOST=${LANGFUSE_HOST}
18
  restart: unless-stopped
 
12
  - NVIDIA_API_KEY=${NVIDIA_API_KEY} # Load this key from .env or manually add the secret
13
  - GOOGLE_API_KEY=${GOOGLE_API_KEY} # Load this key from .env or manually add the secret
14
  - FIRECRAWL_API_KEY=${FIRECRAWL_API_KEY} # Load this key from .env in local/dev environment
15
+ - LANGFUSE_PUBLIC_KEY=${LANGFUSE_PUBLIC_KEY} # Load this key from .env or manually add the secret
16
+ - LANGFUSE_SECRET_KEY=${LANGFUSE_SECRET_KEY} # Load this key from .env or manually add the secret
17
+ - LANGFUSE_HOST=${LANGFUSE_HOST} # Load this key from .env or manually add the secret
18
  restart: unless-stopped
firecrawl_client.py CHANGED
@@ -1,10 +1,27 @@
 
 
 
 
 
 
 
 
1
  from langchain_community.document_loaders import FireCrawlLoader
2
  from langchain_core.documents import Document
3
  from config import FIRE_CRAWL_API_KEY
4
 
5
 
6
  def scrape_with_firecrawl(url: str) -> list[Document]:
7
-
 
 
 
 
 
 
 
 
 
8
  loader = FireCrawlLoader(url=url,
9
  api_key=FIRE_CRAWL_API_KEY,
10
  mode='scrape')
@@ -17,6 +34,17 @@ def scrape_with_firecrawl(url: str) -> list[Document]:
17
  return pages
18
 
19
  def get_markdown_from_documents(docs: list[Document]) -> str:
 
 
 
 
 
 
 
 
 
 
 
20
  markdown_content = ""
21
  for i, doc in enumerate(docs):
22
  markdown_content += f"### Page {i+1}\n"
@@ -25,6 +53,18 @@ def get_markdown_from_documents(docs: list[Document]) -> str:
25
 
26
 
27
  def scrape_and_get_markdown_with_firecrawl(url: str) -> str:
 
 
 
 
 
 
 
 
 
 
 
 
28
  try:
29
  docs = scrape_with_firecrawl(url)
30
  if not docs:
 
1
+ """
2
+ This module provides a client for interacting with the FireCrawl service.
3
+
4
+ It encapsulates the logic for scraping a website using the FireCrawlLoader from
5
+ LangChain, converting the scraped documents into a single markdown string, and
6
+ handling potential errors during the process.
7
+ """
8
+
9
  from langchain_community.document_loaders import FireCrawlLoader
10
  from langchain_core.documents import Document
11
  from config import FIRE_CRAWL_API_KEY
12
 
13
 
14
  def scrape_with_firecrawl(url: str) -> list[Document]:
15
+ """
16
+ Scrapes a given URL using FireCrawl and returns the content as a list of Documents.
17
+
18
+ Args:
19
+ url (str): The URL of the website to scrape.
20
+
21
+ Returns:
22
+ list[Document]: A list of LangChain Document objects, where each document
23
+ represents a scraped page.
24
+ """
25
  loader = FireCrawlLoader(url=url,
26
  api_key=FIRE_CRAWL_API_KEY,
27
  mode='scrape')
 
34
  return pages
35
 
36
  def get_markdown_from_documents(docs: list[Document]) -> str:
37
+ """
38
+ Converts a list of LangChain Documents into a single markdown string.
39
+
40
+ Each document's content is appended, separated by a horizontal rule.
41
+
42
+ Args:
43
+ docs (list[Document]): A list of Document objects to process.
44
+
45
+ Returns:
46
+ str: A string containing the combined content in markdown format.
47
+ """
48
  markdown_content = ""
49
  for i, doc in enumerate(docs):
50
  markdown_content += f"### Page {i+1}\n"
 
53
 
54
 
55
  def scrape_and_get_markdown_with_firecrawl(url: str) -> str:
56
+ """
57
+ Orchestrates the scraping of a URL with FireCrawl and returns the content as markdown.
58
+
59
+ This is the main entry point function for this module. It handles the full
60
+ process of scraping, content conversion, and error handling.
61
+
62
+ Args:
63
+ url (str): The URL of the website to scrape.
64
+
65
+ Returns:
66
+ str: The scraped content in markdown format, or a formatted error message string if an issue occurs.
67
+ """
68
  try:
69
  docs = scrape_with_firecrawl(url)
70
  if not docs:
llm_inference_service.py CHANGED
@@ -1,3 +1,11 @@
 
 
 
 
 
 
 
 
1
  from langchain.chat_models import init_chat_model
2
  from langfuse.langchain import CallbackHandler
3
  from langfuse import Langfuse
@@ -5,24 +13,41 @@ from langfuse import Langfuse
5
  from config import LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST
6
 
7
  # Initialize Langfuse client
8
- # It is safe to do this even if keys are not set, as the handler will only be used if keys are present.
 
 
 
9
  langfuse_callback_handler = None
10
  callbacks = []
11
 
12
  if LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY:
13
- langfuse = Langfuse(
14
  public_key=LANGFUSE_PUBLIC_KEY,
15
  secret_key=LANGFUSE_SECRET_KEY,
16
  host=LANGFUSE_HOST,
17
  )
18
-
19
  langfuse_callback_handler = CallbackHandler()
20
-
21
  callbacks.append(langfuse_callback_handler)
22
 
23
 
24
 
25
  def extract_page_info_by_llm(user_query: str, scraped_markdown_content: str, model_name: str, model_provider: str) -> str:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
  if not scraped_markdown_content:
28
  return "No relevant information found to answer your question."
 
1
+ """
2
+ This module provides the service for interacting with Large Language Models (LLMs).
3
+
4
+ It is responsible for initializing the Langfuse callback handler for tracing,
5
+ constructing the appropriate prompt for information extraction, initializing the
6
+ selected chat model, and invoking the model to get a response.
7
+ """
8
+
9
  from langchain.chat_models import init_chat_model
10
  from langfuse.langchain import CallbackHandler
11
  from langfuse import Langfuse
 
13
  from config import LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST
14
 
15
  # Initialize Langfuse client
16
+ # This block sets up the Langfuse callback handler for LangChain.
17
+ # It initializes the Langfuse client and creates a CallbackHandler instance
18
+ # only if the required API keys are available. The handler is then added to
19
+ # a list of callbacks that can be passed to LLM invocations for tracing.
20
  langfuse_callback_handler = None
21
  callbacks = []
22
 
23
  if LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY:
24
+ Langfuse(
25
  public_key=LANGFUSE_PUBLIC_KEY,
26
  secret_key=LANGFUSE_SECRET_KEY,
27
  host=LANGFUSE_HOST,
28
  )
 
29
  langfuse_callback_handler = CallbackHandler()
 
30
  callbacks.append(langfuse_callback_handler)
31
 
32
 
33
 
34
  def extract_page_info_by_llm(user_query: str, scraped_markdown_content: str, model_name: str, model_provider: str) -> str:
35
+ """
36
+ Extracts information from scraped content using a specified Large Language Model.
37
+
38
+ This function constructs a detailed prompt, initializes the selected chat model,
39
+ and invokes it with the scraped content and user query. If Langfuse is configured,
40
+ it uses a callback handler to trace the LLM interaction.
41
+
42
+ Args:
43
+ user_query (str): The user's query specifying what information to extract.
44
+ scraped_markdown_content (str): The markdown content from the scraped web page.
45
+ model_name (str): The name of the LLM to use for extraction.
46
+ model_provider (str): The provider of the LLM (e.g., 'google_genai', 'nvidia').
47
+
48
+ Returns:
49
+ str: The content of the LLM's response.
50
+ """
51
 
52
  if not scraped_markdown_content:
53
  return "No relevant information found to answer your question."