Spaces:

schandel08
/

WebIQ_backend

Sleeping

App Files Files Community

WebIQ_backend / README.md

schandel08

Backend added

9f84bcd about 1 month ago

preview code

raw

history blame contribute delete

4.21 kB

metadata

title: WebIQ Backend
emoji: 🌍
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

WebIQ – Boosts your web intelligence with AI-powered insights

Overview

WebIQ is a powerful web scraping and question-answering (QA) chatbot that follows the Retrieval-Augmented Generation (RAG) pipeline. It extracts and retrieves key insights from any website and generates AI-powered responses based on the extracted data. WebIQ leverages FAISS for efficient similarity search, LangChain for retrieval orchestration, and state-of-the-art LLMs for response generation.

Features

Automated Web Scraping: Extracts text data from webpages, caches it locally, and supports both targeted and full-site scraping.
Vector Embeddings: Uses FAISS to store and retrieve information efficiently.
LLM Integration: Supports OpenAI (GPT-4) and Hugging Face (Llama-2, Mistral, etc.).
Chunking for Optimization: Splits documents into meaningful chunks to enhance retrieval quality.
Asynchronous Processing: Uses asyncio for efficient execution.
Caching Mechanism: Ensures previously processed webpages are not reprocessed.
Batch Processing: Processes large numbers of URLs efficiently.
Memory Usage Logging: Tracks memory consumption before and after each batch for efficiency monitoring.
Multi-Page Scraping: Seamlessly scrapes content from multiple webpages and aggregates insights.

Installation

Clone the repository:

git clone https://github.com/Siddharth-Chandel/WebIQ.git
cd WebIQ

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install the required dependencies:
```
pip install -r requirements.txt
```

Set up environment variables by creating a .env file:

HUGGINGFACEHUB_API_TOKEN=your_huggingface_token
OPENAI_API_KEY=your_openai_api_key  # If using OpenAI

Usage

Run the chatbot script:
```
python chatbot.py
```
Enter a URL when prompted (e.g., https://playwright.dev).
Enter your query (e.g., Describe Playwright and its benefits).
The chatbot will scrape the webpage, process the data, and return an AI-generated response.

Example Output

====================* Answer *====================
Playwright is an end-to-end testing framework that provides...

=================* Source Documents *=================
Source 1:
file: cache/playwright-dev/pages/page_1.txt
Content: Playwright is a Node.js library that automates browsers.

Practical Use Cases

Research Assistance: Quickly extract and summarize information from research papers, blogs, or documentation.
Competitive Analysis: Monitor competitors' websites and extract relevant insights for business strategy.
Customer Support: Enhance chatbot capabilities by integrating real-time website data retrieval.
Market Intelligence: Gather structured data from news sites, product pages, or financial reports for analysis.
SEO Optimization: Analyze webpage content for better keyword targeting and content strategy.

Technologies Used

RAG (Providing better context)
LangChain (Retrieval-based QA system)
FAISS (Efficient similarity search)
Hugging Face Transformers (LLMs & embeddings)
OpenAI GPT-4 (Optional for LLM-based response generation)
Crawl4AI (An LLM-based web-scraper)
AsyncIO (Increment the processing speed)
Rich (For colorful CLI outputs)

Future Enhancements

Develop an interactive web UI using Streamlit or FastAPI for a seamless user experience.
Enhance retrieval quality with advanced RAG tuning and improved embeddings.

License

This project is licensed under the MIT License.

Author

Siddharth Chandel - Developed as part of NLP & AI research. Let's connect on LinkedIn !!!

Contributions are welcome! Feel free to fork and enhance. 🚀