Spaces:

Hopsakee
/

fabric_to_espanso

Sleeping

App Files Files Community

Hopsakee commited on Mar 2

Commit

5b40ec9

verified ·

1 Parent(s): fdaaf27

Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

.gitattributes +1 -0
README.md +138 -62
data/Fab2Esp_transparent.png +0 -0
parameters.py +3 -2
src/fabrics_processor/config.py +10 -9
src/fabrics_processor/database.py +11 -9
src/fabrics_processor/database_updater.py +32 -27
src/search_qdrant/streamlit_app.py +9 -9

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+data/Fab2Esp_transparent.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -6,98 +6,174 @@ sdk_version: 5.12.0
 ---
 # Fabric to Espanso Converter
-A Python application that bridges Fabric prompts with Espanso by managing and converting prompts through a vector database.
 ## Features
-- Store and manage Fabric prompts in a Qdrant vector database
-- Convert stored prompts into Espanso YAML format for system-wide usage
-- Semantic search functionality to find relevant prompts based on their meaning
-- Web interface for easy interaction with the prompt database
 ## Prerequisites
-- Python 3.11
-- Qdrant vector database server (local or cloud)
-- Obsidian with MeshAI plugin installed
-- Windows (for PowerShell script) or Linux/WSL for direct execution
 ## Installation
-1. Install Obsidian and the MeshAI plugin
-2. In Obsidian, create the following folder structure:
-   ```
-   Extra/
-   └── FabricPatterns/
-       ├── Official/  # For downloaded Fabric patterns
-       └── Own/       # For your custom additions
-   ```
-3. Clone this repository
-4. Install dependencies using PDM:
    ```bash
    pdm install
    ```
-5. Configure your Qdrant server connection in the application settings
 ## Usage
-### Linux/WSL
-Run the Streamlit application directly:
 ```bash
-./src/search_qdrant/run_streamlit.sh
 ```
-### Windows
-Create a PowerShell script with the following content to start the application:
 ```powershell
-# Start WSL process without showing window
-$startInfo = New-Object System.Diagnostics.ProcessStartInfo
-$startInfo.Filename = "wsl.exe"
-# Use -c flag to let the command use the WSL2 Ubuntu folder system and not the Windows
-$startInfo.Arguments = "bash -c ~/Tools/pythagora-core/workspace/fabrics_processor/src/search_qdrant/run_streamlit.sh"
-$startInfo.UseShellExecute = $false
-$startInfo.RedirectStandardOutput = $true
-$startInfo.RedirectStandardError = $true
-$startInfo.WindowStyle = [System.Diagnostics.ProcessWindowStyle]::Hidden
-$startInfo.CreateNoWindow = $true
-# Start the process
-try {
-    $process = [System.Diagnostics.Process]::Start($startInfo)
-    Start-Sleep -Seconds 5
-    # Check if Streamlit is actually running
-    $streamlitRunning = Test-NetConnection -ComputerName localhost -Port 8501 -WarningAction SilentlyContinue
-    if ($streamlitRunning.TcpTestSucceeded) {
-        Start-Process "msedge.exe" "--app=http://localhost:8501"
-    } else {
-        Write-Error "Failed to start Streamlit application"
-    }
-} catch {
-    Write-Error "Error starting Streamlit: $_"
-}
 ```
-This script will:
-1. Start the Streamlit server if it's not already running
-2. Open the application in Microsoft Edge in app mode
-3. Automatically handle server startup and connection
 ## Dependencies
-- ipykernel >= 6.29.5
-- markdown >= 3.7
-- pyyaml >= 6.0.2
 - qdrant-client >= 1.12.1
 - fastembed >= 0.4.2
-- streamlit >= 1.41.1
 - pyperclip >= 1.9.0
 - regex >= 2024.11.6
 ## License
 This project is licensed under the MIT License.

 ---
 # Fabric to Espanso Converter
+A Python application that bridges Fabric prompts with Espanso and Obsidian Textgenerator by managing and converting prompts through a vector database. It enables semantic search and efficient management of prompts while providing a modern web interface for easy interaction.
+There's also a seperate gradio app that can be hosted on Hugging Face Spaces to provide a query-only interface.
 ## Features
+- **Vector Database Integration**: Store and manage Fabric prompts in a Qdrant vector database with semantic search capabilities
+- **Automated Conversion**: Convert stored prompts into Espanso YAML format for system-wide usage
+- **Change Detection**: Automatically detect and process changes in the Fabric patterns folder
+- **Web Interface**: Modern Gradio-based interface for easy prompt searching and management
+- **Semantic Search**: Find relevant prompts based on their meaning, not just exact matches
+- **Clipboard Integration**: Quick copying of prompts directly to clipboard
+- **Logging System**: Comprehensive logging for tracking operations and debugging
 ## Prerequisites
+- Python 3.11 or higher
+- Fabric (https://github.com/danielmiessler/fabric)
+- Qdrant vector database (local or cloud instance)
+- Obsidian with TextGenerator plugin (https://github.com/obsidianmd/obsidian-textgenerator)
+- Linux/WSL2 or Windows with WSL2
 ## Installation
+1. **Environment Setup**:
    ```bash
+   # Clone the repository
+   git clone [repository-url]
+   cd fabric_to_espanso
+   # Install PDM if not already installed
+   pip install pdm
+   # Install dependencies
    pdm install
    ```
+2. **Configuration**:
+   - Copy `.env.example` to `.env`
+   - Set your Qdrant API key in `.env`:
+     ```
+     QDRANT_API_KEY=your_api_key_here
+     ```
+3. **Obsidian Setup**:
+   - Install Obsidian and the TextGenerator plugin
+   - Create the folder structure:
+     ```
+     Extra/
+     └── FabricPatterns/
+         ├── Official/  # Official Fabric patterns
+         └── Own/       # Custom patterns
+     ```
+4. **Fabric Setup**:
+   - Install Fabric, see https://github.com/danielmiessler/fabric
+5. **QDRANT Setup**:
+   - Install Qdrant, see https://qdrant.io/en/
+   - Start Qdrant server
+6. **Parameters**:
+   - Set all the parameters in the file `parameters.py`.
+7. **Optional**:
+   - Create a Powershell script to run the Streamlit app
 ## Usage
+### Starting the Application
+#### Linux/WSL2
 ```bash
+# Start the Gradio interface
+python gradio_app_query_only.py
 ```
+#### Windows (with WSL2)
 ```powershell
+# Use the provided PowerShell script
+./start_app.ps1
 ```
+### Core Operations
+1. **Search Prompts**:
+   - Enter your search query in the search box
+   - Results are ranked by semantic similarity
+   - Click on a result to view its contents
+2. **Copy Prompts**:
+   - Select a prompt from the results
+   - Click "Copy to Clipboard" to copy the prompt text
+3. **Update Database**:
+   - Run `python main.py` to process changes in the Fabric patterns folder
+   - New and modified prompts are automatically added to the database
+   - Deleted prompts are removed from the database
+## Project Structure
+```
+fabric_to_espanso/
+├── src/
+│   ├── fabrics_processor/    # Core processing logic
+│   └── search_qdrant/        # Search functionality
+├── gradio_app_query_only.py  # Web interface
+├── main.py                   # CLI entry point
+└── parameters.py            # Configuration parameters
+```
 ## Dependencies
+Core dependencies are managed through PDM:
+- gradio >= 5.12.0
 - qdrant-client >= 1.12.1
 - fastembed >= 0.4.2
+- python-dotenv
 - pyperclip >= 1.9.0
+- pyyaml >= 6.0.2
 - regex >= 2024.11.6
+## TODO
+The following items need to be addressed to improve code quality, maintainability, and functionality:
+### Database Optimization
+- Check the database for any points with exactly the same vector or nearly the same. Remove those to reduce redundancy and improve search efficiency.
+### Metadata Enhancement
+- If available, use the readme.md file from the fabrics folder to fill the "purpose" field in the database entries.
+- If readme.md is not available in the fabrics folder, create the "purpose" field from an LLM response that summarizes the goal of the fabric file.
+### UI/UX Improvements
+- Add a compare interface to the gradio app to allow side-by-side comparison of prompts.
+- Remove the streamlit_only_query app as it's being replaced by the gradio interface.
+### Code Refactoring
+- Implement proper error handling for database operations.
+- Add comprehensive logging throughout the application.
+- Create unit tests for core functionality.
+- Implement type hints consistently across all Python files.
+- Add input validation for all user-provided data.
+- Refactor the database operations into a dedicated class.
+- Implement connection pooling for better database performance.
+- Add docstrings to all functions and classes.
+- Create a configuration class to handle all settings.
+- Add proper cleanup of resources in error cases.
+### Documentation
+- Add API documentation for all public interfaces.
+- Include examples for common use cases.
+- Document the database schema and vector space organization.
+- Add contribution guidelines.
+- Include troubleshooting section.
+### Security
+- Implement proper environment variable handling.
+- Add input sanitization for all user inputs.
+- Implement rate limiting for the web interface.
+- Add proper authentication for the web interface.
+### Performance
+- Implement caching for frequently accessed prompts.
+- Optimize vector similarity search parameters.
+- Add batch processing for large-scale operations.
 ## License
 This project is licensed under the MIT License.

data/Fab2Esp_transparent.png CHANGED Viewed

Git LFS Details

SHA256: 2830b7e02e6a798c3a95eeb1f0cb0f68bab3901e15287270eb7288a50b83f8e6
Pointer size: 131 Bytes
Size of remote file: 779 kB

parameters.py CHANGED Viewed

@@ -47,7 +47,7 @@ BASE_WORDS = ['Identity', 'Purpose', 'Task', 'Goal']
 # COLLECTION_NAME = "fabric_patterns"
 # Cloud:
 QDRANT_URL = "https://91ed3a93-6135-4951-a624-1c8c2878240d.europe-west3-0.gcp.cloud.qdrant.io:6333"
-COLLECTION_NAME = "fabric_patterns"
 # Required fields for database points
 # TODO: default trigger wordt nu twee keer gedefinieerd, oplossen
@@ -61,4 +61,5 @@ REQUIRED_FIELDS_DEFAULTS = {
 # Embedding Model parameters voor Qdrant
 USE_FASTEMBED = True
-EMBED_MODEL = "fast-bge-small-en"

 # COLLECTION_NAME = "fabric_patterns"
 # Cloud:
 QDRANT_URL = "https://91ed3a93-6135-4951-a624-1c8c2878240d.europe-west3-0.gcp.cloud.qdrant.io:6333"
+COLLECTION_NAME = "fabric_patterns_hybrid"
 # Required fields for database points
 # TODO: default trigger wordt nu twee keer gedefinieerd, oplossen
 # Embedding Model parameters voor Qdrant
 USE_FASTEMBED = True
+EMBED_MODEL_DENSE = 'BAAI/bge-base-en' # "fast-bge-small-en"
+EMBED_MODEL_SPARSE = "prithivida/Splade_PP_en_v1"

src/fabrics_processor/config.py CHANGED Viewed

@@ -16,7 +16,8 @@ from parameters import (
     BASE_WORDS,
     QDRANT_URL,
     USE_FASTEMBED,
-    EMBED_MODEL,
     COLLECTION_NAME,
     REQUIRED_FIELDS,
     REQUIRED_FIELDS_DEFAULTS
@@ -61,22 +62,22 @@ class DatabaseConfig:
             raise ConfigurationError(str(e))
 @dataclass
-class EmbeddingConfig:
     """Embedding model configuration."""
     use_fastembed: bool = USE_FASTEMBED
-    model_name: str = EMBED_MODEL
     collection_name: str = COLLECTION_NAME
-    vector_size: int = 384
     def validate(self) -> None:
         """Validate the embedding configuration."""
-        if not self.model_name:
             from .exceptions import ConfigurationError
-            raise ConfigurationError("Embedding model name cannot be empty")
-        if self.vector_size <= 0:
             from .exceptions import ConfigurationError
-            raise ConfigurationError(f"Vector size must be > 0, got {self.vector_size}")
 class Config:
     """Global configuration singleton."""
@@ -86,7 +87,7 @@ class Config:
         if cls._instance is None:
             cls._instance = super().__new__(cls)
             cls._instance.database = DatabaseConfig()
-            cls._instance.embedding = EmbeddingConfig()
             cls._instance.espanso_trigger = DEFAULT_TRIGGER
             cls._instance.fabric_patterns_folder = FABRIC_PATTERNS_FOLDER
             cls._instance.yaml_output_folder = YAML_OUTPUT_FOLDER

     BASE_WORDS,
     QDRANT_URL,
     USE_FASTEMBED,
+    EMBED_MODEL_DENSE,
+    EMBED_MODEL_SPARSE,
     COLLECTION_NAME,
     REQUIRED_FIELDS,
     REQUIRED_FIELDS_DEFAULTS
             raise ConfigurationError(str(e))
 @dataclass
+class EmbeddingModelConfig:
     """Embedding model configuration."""
     use_fastembed: bool = USE_FASTEMBED
     collection_name: str = COLLECTION_NAME
+    dense_model_name: str = EMBED_MODEL_DENSE
+    sparse_model_name: str = EMBED_MODEL_SPARSE
     def validate(self) -> None:
         """Validate the embedding configuration."""
+        if not self.dense_model_name:
             from .exceptions import ConfigurationError
+            raise ConfigurationError("Dense Embedding model name cannot be empty")
+        if not self.sparse_model_name:
             from .exceptions import ConfigurationError
+            raise ConfigurationError("Sparse Embedding model name cannot be empty")
 class Config:
     """Global configuration singleton."""
         if cls._instance is None:
             cls._instance = super().__new__(cls)
             cls._instance.database = DatabaseConfig()
+            cls._instance.embedding = EmbeddingModelConfig()
             cls._instance.espanso_trigger = DEFAULT_TRIGGER
             cls._instance.fabric_patterns_folder = FABRIC_PATTERNS_FOLDER
             cls._instance.yaml_output_folder = YAML_OUTPUT_FOLDER

src/fabrics_processor/database.py CHANGED Viewed

@@ -52,7 +52,8 @@ def initialize_qdrant_database(
     api_key: Optional[str] = "",
     collection_name: str = config.embedding.collection_name,
     use_fastembed: bool = config.embedding.use_fastembed,
-    embed_model: str = config.embedding.model_name
 ) -> QdrantClient:
     """Initialize the Qdrant database for storing markdown file information.
@@ -75,6 +76,9 @@ def initialize_qdrant_database(
         # Create database connection
         client = create_database_connection(url=url, api_key=api_key)
         # Check if collection exists
         collections = client.get_collections()
@@ -85,19 +89,17 @@ def initialize_qdrant_database(
             # Create collection with appropriate vector configuration
             if use_fastembed:
-                vector_config = client.get_fastembed_vector_params()
             else:
-                vector_config = {
-                    embed_model: VectorParams(
-                        size=config.embedding.vector_size,
-                        distance=Distance.COSINE
-                    )
-                }
             try:
                 client.create_collection(
                     collection_name=collection_name,
-                    vectors_config=vector_config,
                     on_disk_payload=True
                 )
             except exceptions.UnexpectedResponse as e:

     api_key: Optional[str] = "",
     collection_name: str = config.embedding.collection_name,
     use_fastembed: bool = config.embedding.use_fastembed,
+    dense_model: str = config.embedding.dense_model_name,
+    sparse_model: str = config.embedding.sparse_model_name
 ) -> QdrantClient:
     """Initialize the Qdrant database for storing markdown file information.
         # Create database connection
         client = create_database_connection(url=url, api_key=api_key)
+        client.set_model(dense_model)
+        client.set_sparse_model(sparse_model)
         # Check if collection exists
         collections = client.get_collections()
             # Create collection with appropriate vector configuration
             if use_fastembed:
+                vectors_config = client.get_fastembed_vector_params()
+                sparse_vectors_config = client.get_fastembed_sparse_vector_params()
             else:
+                print("Creating database without Fastembed not implemented yet.")
+                raise NotImplementedError()
             try:
                 client.create_collection(
                     collection_name=collection_name,
+                    vectors_config=vectors_config,
+                    sparse_vectors_config=sparse_vectors_config,
                     on_disk_payload=True
                 )
             except exceptions.UnexpectedResponse as e:

src/fabrics_processor/database_updater.py CHANGED Viewed

@@ -1,7 +1,7 @@
 from typing import Optional
 from qdrant_client import QdrantClient
 from qdrant_client.http.models import PointStruct, Filter, FieldCondition, MatchValue, PointIdsList
-from fastembed import TextEmbedding
 import logging
 import uuid
 from .output_files_generator import generate_yaml_file, generate_markdown_files
@@ -11,7 +11,7 @@ from .database import validate_point_payload
 logger = logging.getLogger('fabric_to_espanso')
-def get_embedding(text: str, embedding_model: TextEmbedding) -> list:
     """
     Generate embedding vector for the given text using FastEmbed.
@@ -19,10 +19,25 @@ def get_embedding(text: str, embedding_model: TextEmbedding) -> list:
         text (str): Text to generate embedding for
     Returns:
-        list: Embedding vector
     """
-    embeddings = list(embedding_model.embed([text]))
-    return embeddings[0].tolist()
 def update_qdrant_database(client: QdrantClient, collection_name: str, new_files: list, modified_files: list, deleted_files: list):
     """
@@ -34,16 +49,10 @@ def update_qdrant_database(client: QdrantClient, collection_name: str, new_files
         modified_files (list): List of modified files to be updated in the database.
         deleted_files (list): List of deleted files to be removed from the database.
     """
-    # Initialize the FastEmbed model (done once)
-    if config.embedding.use_fastembed:
-        # TODO: I think it is possible to choose another model here. Make that an option
-        logger.info(f"Initializing FastEmbed model.")
-        embedding_model = TextEmbedding()
-    else:
-        logger.info(f"Initializing embbeding model: {config.model_name}")
-        # TODO: testen. Weet niet of dit werkt.
-        embedding_model = TextEmbedding(model_name=config.model_name)
     try:
         # Add new files
@@ -52,9 +61,10 @@ def update_qdrant_database(client: QdrantClient, collection_name: str, new_files
                 payload_new = validate_point_payload(file)
                 point = PointStruct(
                     id=str(uuid.uuid4()),  # Generate a new UUID for each point
-                # TODO: 'fast-bge-small-en' is de naam van de vector. Je kunt de naam vinden door: client.get_vector_field_name()
-                    vector={'fast-bge-small-en':
-                            get_embedding(payload_new['purpose'], embedding_model)},  # Generate vector from purpose field
                     payload={
                         "filename": payload_new['filename'],
                         "content": payload_new['content'],
@@ -87,15 +97,10 @@ def update_qdrant_database(client: QdrantClient, collection_name: str, new_files
                     # Update the existing point with the new file data
                     point = PointStruct(
                         id=point_id,
-                        # LET OP: als je 'fastembed' gebruikt, moet je de naam van de vector gebruiken.
-                        # In dit geval is de naam 'fast-bge-small-en'.
-                        # Gebruik je fastembed niet, maar rechtstreeks de QDRANT api, dan kun je ook gebruik maken
-                        # van unnamed vectors en kun je dus schrrijven vector = get_embedding(file['purpose'], embedding_model)
-                        # Zie https://github.com/qdrant/qdrant-client/discussions/598
-                        # De naam die fastembed gebruikt is afhankelijk van het model dat je gebruikt.
-                        # Je kunt de naam vinden door: client.get_vector_field_name()
-                        vector={'fast-bge-small-en':
-                            get_embedding(file['purpose'], embedding_model)},  # Generate vector from purpose field
                         payload={
                         "filename": payload_current['filename'],
                         "content": file['content'],

 from typing import Optional
 from qdrant_client import QdrantClient
 from qdrant_client.http.models import PointStruct, Filter, FieldCondition, MatchValue, PointIdsList
+from fastembed import TextEmbedding, SparseTextEmbedding
 import logging
 import uuid
 from .output_files_generator import generate_yaml_file, generate_markdown_files
 logger = logging.getLogger('fabric_to_espanso')
+def get_embedding(text: str) -> list:
     """
     Generate embedding vector for the given text using FastEmbed.
         text (str): Text to generate embedding for
     Returns:
+        list: Tuple of (dense_embeddings, sparse_embeddings)
     """
+    if not config.embedding.use_fastembed:
+        msg = "Embedding model not initialized. Set use_fastembed to True in the configuration."
+        logger.error(msg)
+        raise ConfigurationError(msg)
+    # Models are lazily initialized only when needed
+    if not hasattr(get_embedding, '_dense_model'):
+        get_embedding._dense_model = TextEmbedding(model_name=config.embedding.dense_model_name)
+    if not hasattr(get_embedding, '_sparse_model'):
+        get_embedding._sparse_model = SparseTextEmbedding(model_name=config.embedding.sparse_model_name)
+    dense_embeddings = list(get_embedding._dense_model.embed(text))[0]
+    sparse_embedding = list(get_embedding._sparse_model.embed(text, return_dense=False))[0]
+    return dense_embeddings, {
+        'indices': sparse_embedding.indices.tolist(),
+        'values': sparse_embedding.values.tolist()
+    }
 def update_qdrant_database(client: QdrantClient, collection_name: str, new_files: list, modified_files: list, deleted_files: list):
     """
         modified_files (list): List of modified files to be updated in the database.
         deleted_files (list): List of deleted files to be removed from the database.
     """
+    if not config.embedding.use_fastembed:
+        msg = "Embedding model not initialized. Set use_fastembed to True in the configuration."
+        logger.info(msg)
+        return
     try:
         # Add new files
                 payload_new = validate_point_payload(file)
                 point = PointStruct(
                     id=str(uuid.uuid4()),  # Generate a new UUID for each point
+                    vector={
+                        'fast-bge-base-en': get_embedding(payload_new['purpose'])[0],
+                        'fast-sparse-splade_pp_en_v1': get_embedding(payload_new['purpose'])[1]
+                    },
                     payload={
                         "filename": payload_new['filename'],
                         "content": payload_new['content'],
                     # Update the existing point with the new file data
                     point = PointStruct(
                         id=point_id,
+                        vector={
+                            'fast-bge-base-en': get_embedding(payload_current['purpose'])[0],
+                            'fast-sparse-splade_pp_en_v1': get_embedding(payload_current['purpose'])[1]
+                        },
                         payload={
                         "filename": payload_current['filename'],
                         "content": file['content'],

src/search_qdrant/streamlit_app.py CHANGED Viewed

@@ -1,4 +1,5 @@
 import streamlit as st
 import pyperclip
 from pathlib import Path
 from src.fabrics_processor.database import initialize_qdrant_database
@@ -155,15 +156,14 @@ def update_database():
                 fabric_patterns_folder=config.fabric_patterns_folder
             )
-            # Update the database if chenges are detected
-            if any([new_files, modified_files, deleted_files]):
-                update_qdrant_database(
-                    client=st.session_state.client,
-                    collection_name=config.embedding.collection_name,
-                    new_files=new_files,
-                    modified_files=modified_files,
-                    deleted_files=deleted_files
-                )
             # Get updated collection info
             collection_info = st.session_state.client.get_collection(config.embedding.collection_name)

 import streamlit as st
+import os
 import pyperclip
 from pathlib import Path
 from src.fabrics_processor.database import initialize_qdrant_database
                 fabric_patterns_folder=config.fabric_patterns_folder
             )
+            # Update the database
+            update_qdrant_database(
+                client=st.session_state.client,
+                collection_name=config.embedding.collection_name,
+                new_files=new_files,
+                modified_files=modified_files,
+                deleted_files=deleted_files
+            )
             # Get updated collection info
             collection_info = st.session_state.client.get_collection(config.embedding.collection_name)