WebIQ_backend / README.md
schandel08's picture
Backend added
9f84bcd
metadata
title: WebIQ Backend
emoji: 🌍
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

WebIQ – Boosts your web intelligence with AI-powered insights

Overview

WebIQ is a powerful web scraping and question-answering (QA) chatbot that follows the Retrieval-Augmented Generation (RAG) pipeline. It extracts and retrieves key insights from any website and generates AI-powered responses based on the extracted data. WebIQ leverages FAISS for efficient similarity search, LangChain for retrieval orchestration, and state-of-the-art LLMs for response generation.

Features

  • Automated Web Scraping: Extracts text data from webpages, caches it locally, and supports both targeted and full-site scraping.
  • Vector Embeddings: Uses FAISS to store and retrieve information efficiently.
  • LLM Integration: Supports OpenAI (GPT-4) and Hugging Face (Llama-2, Mistral, etc.).
  • Chunking for Optimization: Splits documents into meaningful chunks to enhance retrieval quality.
  • Asynchronous Processing: Uses asyncio for efficient execution.
  • Caching Mechanism: Ensures previously processed webpages are not reprocessed.
  • Batch Processing: Processes large numbers of URLs efficiently.
  • Memory Usage Logging: Tracks memory consumption before and after each batch for efficiency monitoring.
  • Multi-Page Scraping: Seamlessly scrapes content from multiple webpages and aggregates insights.

Installation

  1. Clone the repository:

    git clone https://github.com/Siddharth-Chandel/WebIQ.git
    cd WebIQ
    
  2. Create a virtual environment and activate it:

    python -m venv venv
    source venv/bin/activate  # On Windows use `venv\Scripts\activate`
    
  3. Install the required dependencies:

    pip install -r requirements.txt
    
  4. Set up environment variables by creating a .env file:

    HUGGINGFACEHUB_API_TOKEN=your_huggingface_token
    OPENAI_API_KEY=your_openai_api_key  # If using OpenAI
    

Usage

  1. Run the chatbot script:
    python chatbot.py
    
  2. Enter a URL when prompted (e.g., https://playwright.dev).
  3. Enter your query (e.g., Describe Playwright and its benefits).
  4. The chatbot will scrape the webpage, process the data, and return an AI-generated response.

Example Output

====================* Answer *====================
Playwright is an end-to-end testing framework that provides...

=================* Source Documents *=================
Source 1:
file: cache/playwright-dev/pages/page_1.txt
Content: Playwright is a Node.js library that automates browsers.

Practical Use Cases

  • Research Assistance: Quickly extract and summarize information from research papers, blogs, or documentation.
  • Competitive Analysis: Monitor competitors' websites and extract relevant insights for business strategy.
  • Customer Support: Enhance chatbot capabilities by integrating real-time website data retrieval.
  • Market Intelligence: Gather structured data from news sites, product pages, or financial reports for analysis.
  • SEO Optimization: Analyze webpage content for better keyword targeting and content strategy.

Technologies Used

  • RAG (Providing better context)
  • LangChain (Retrieval-based QA system)
  • FAISS (Efficient similarity search)
  • Hugging Face Transformers (LLMs & embeddings)
  • OpenAI GPT-4 (Optional for LLM-based response generation)
  • Crawl4AI (An LLM-based web-scraper)
  • AsyncIO (Increment the processing speed)
  • Rich (For colorful CLI outputs)

Future Enhancements

  • Develop an interactive web UI using Streamlit or FastAPI for a seamless user experience.
  • Enhance retrieval quality with advanced RAG tuning and improved embeddings.

License

This project is licensed under the MIT License.

Author

Siddharth Chandel - Developed as part of NLP & AI research. Let's connect on LinkedIn !!!


Contributions are welcome! Feel free to fork and enhance. πŸš€