diff --git a/graphrag_construct.html b/graphrag_construct.html new file mode 100644 index 0000000000000000000000000000000000000000..90509da9aa7b1e24521b1518f14bc149e0a0270e --- /dev/null +++ b/graphrag_construct.html @@ -0,0 +1,444 @@ + + + + + + + + + + + + + + + +
+

+
+ + + + + + +
+

+
+ + + + + +
+ + +
+
+ + +
+
+
0%
+
+
+
+
+
+ + + + + + \ No newline at end of file diff --git a/graphrag_demo.html b/graphrag_demo.html new file mode 100644 index 0000000000000000000000000000000000000000..a4741e84738c6db2f33da452e2d6b831751f5057 --- /dev/null +++ b/graphrag_demo.html @@ -0,0 +1,459 @@ + + + + + + + + + + + + + + + +
+

+
+ + + + + + +
+

+
+ + + + + +
+ + +
+
+ + +
+
+
0%
+
+
+
+
+
+ + +
+ + + + + \ No newline at end of file diff --git a/graphrag_readme.md b/graphrag_readme.md new file mode 100644 index 0000000000000000000000000000000000000000..8bd0a030339a2f1a90a2e3570c51be7e48c11aed --- /dev/null +++ b/graphrag_readme.md @@ -0,0 +1,351 @@ +# GraphRAG README + +## Some fundamental concepts + +### Data Ingestion + +NOTE: mermaid.js diagrams below are based on some inspiring content from the [Connected Data London 2024: Entity Resolved Knowledge Graphs](https://github.com/DerwenAI/cdl2024_masterclass/blob/main/README.md) masterclass. + +```mermaid +graph TD + %% Database shapes with consistent styling + SDS[(Structured
Data Sources)] + UDS[(Unstructured
Data Sources)] + LG[(lexical graph)] + SG[(semantic graph)] + VD[(vector database)] + + %% Flow from structured data + SDS -->|PII features| ER[entity resolution] + SDS -->|data records| SG + SG -->|PII updates| ER + ER -->|semantic overlay| SG + + %% Schema and ontology + ONT[schema, ontology, taxonomy,
controlled vocabularies, etc.] + ONT --> SG + + %% Flow from unstructured data + UDS --> K[text chunking
function] + K --> NLP[NLP parse] + K --> EM[embedding model] + NLP --> E[NER, RE] + E --> LG + LG --> EL[entity linking] + EL <--> SG + + %% Vector elements connections + EM --> VD + VD -.->|capture source chunk
WITHIN references| SG + + %% Thesaurus connection + ER -.->T[thesaurus] + T --> EL + + %% Styling classes + classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px; + classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px; + classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px; + classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px; + classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px; + classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px; + + %% Apply styles by layer/type + class SDS,UDS dataSource; + class SG,VD storage; + class EM embedding; + class LG lexical; + class SG semantic; + class ONT,T reference; +``` + +### Augment LLM Inference + +```mermaid +graph LR + %% Define database and special shapes + P[prompt] + SG[(semantic graph)] + VD[(vector database)] + LLM[LLM] + Z[response] + + %% Main flow paths + P --> Q[generated query] + P --> EM[embedding model] + + %% Upper path through graph elements + Q --> SG + SG --> W[semantic
random walk] + T[thesaurus] --> W + W --> GA[graph analytics] + + %% Lower path through vector elements + EM --> SS[vector
similarity search] + SS --> VD + + %% Node embeddings and chunk references + SG -.-|chunk references| VD + SS -->|node embeddings| SG + + %% Final convergence + GA --> RI[ranked index] + VD --> RI + RI --> LLM + LLM --> Z + + %% Styling classes + classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px; + classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px; + classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px; + classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px; + classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px; + classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px; + + %% Apply styles by layer/type + class SDS,UDS dataSource; + class SG,VD storage; + class EM embedding; + class LG lexical; + class SG semantic; + class ONT,T reference; +``` + +## Sequence Diagram - covering the current `strwythura` (structure) repo + +- the diagram below is largely based on the `demo.py` functions +- I used [Prefect](https://www.prefect.io/) to `dig in` and reverse architect the flow... + - [graphrag_demo.py](./graphrag_demo.py) is my simple update to [Paco's original python code](./demo.py) + - I stuck to using Prefect function decorators based on the existing structure, but I'm looking forward to abstracting some of the concepts out further and thinking agentically. +- Telemetry and instrumentation can often demystify complex processes, without the headaches of wading through long print statements. Some great insight often occurs when you can see how individual functions / components are interacting. + - this repo features a large and distinguished cast of open source models (GLiNER, GLiREL), open source embeddings (BGE, Word2Vec) and a vector store (LanceDB) for improved entity recognition and relationship extraction. +- For a deeper dive, [Paco's YouTube video and associated diagrams](https://senzing.com/gph-graph-rag-llm-knowledge-graphs/) help highlight real-world use cases where effective Knowledge Graph construction can provide deeper meaning and insight. + + +```mermaid +sequenceDiagram + participant Main as Main Script + participant ConstructKG as construct_kg Flow + participant InitNLP as init_nlp Task + participant ScrapeHTML as scrape_html Task + participant MakeChunk as make_chunk Task + participant ParseText as parse_text Task + participant MakeEntity as make_entity Task + participant ExtractEntity as extract_entity Task + participant ExtractRelations as extract_relations Task + participant ConnectEntities as connect_entities Task + participant RunTextRank as run_textrank Task + participant AbstractOverlay as abstract_overlay Task + participant GenPyvis as gen_pyvis Task + + Main->>ConstructKG: Start construct_kg flow + ConstructKG->>InitNLP: Initialize NLP pipeline + InitNLP-->>ConstructKG: Return NLP object + + loop For each URL in url_list + ConstructKG->>ScrapeHTML: Scrape HTML content + ScrapeHTML->>MakeChunk: Create text chunks + MakeChunk-->>ScrapeHTML: Return chunk list + ScrapeHTML-->>ConstructKG: Return chunk list + + loop For each chunk in chunk_list + ConstructKG->>ParseText: Parse text and build lex_graph + ParseText->>MakeEntity: Create entities from spans + MakeEntity-->>ParseText: Return entity + ParseText->>ExtractEntity: Extract and add entities to lex_graph + ExtractEntity-->>ParseText: Entity added to graph + ParseText->>ExtractRelations: Extract relations between entities + ExtractRelations-->>ParseText: Relations added to graph + ParseText->>ConnectEntities: Connect co-occurring entities + ConnectEntities-->>ParseText: Connections added to graph + ParseText-->>ConstructKG: Return parsed doc + end + + ConstructKG->>RunTextRank: Run TextRank on lex_graph + RunTextRank-->>ConstructKG: Return ranked entities + ConstructKG->>AbstractOverlay: Overlay semantic graph + AbstractOverlay-->>ConstructKG: Overlay completed + end + + ConstructKG->>GenPyvis: Generate Pyvis visualization + GenPyvis-->>ConstructKG: Visualization saved + ConstructKG-->>Main: Flow completed +``` + +## Run the code + +1. setup local Python environment and install Python dependencies + + - I used Python 3.11, but 3.10 should work as well + + ```bash + pip install -r requirements.txt + ``` + +2. Start the local Prefect server + + - follow the [self-hosted instructions](https://docs.prefect.io/v3/get-started/quickstart#connect-to-a-prefect-api) to launch the `Prefect UI` + + ```python + prefect server start + ``` + +3. run the `graphrag_demo.py` script + + ```python + python graphrag_demo.py + ``` + +## Appendix: Code Overview and Purpose + +- The code forms part of a talk for **GraphGeeks.org** about constructing **knowledge graphs** from **unstructured data sources**, such as web content. +- It integrates web scraping, natural language processing (NLP), graph construction, and interactive visualization. + +--- + +### **Key Components and Flow** + +#### **1. Model and Parameter Settings** +- **Core Configuration**: Establishes the foundational settings like chunk size, embedding models (`BAAI/bge-small-en-v1.5`), and database URIs. +- **NER Labels**: Defines entity categories such as `Person`, `Organization`, `Publication`, and `Technology`. +- **Relation Types**: Configures relationships like `works_at`, `developed_by`, and `authored_by` for connecting entities. +- **Scraping Parameters**: Sets user-agent headers for web requests. + +#### **2. Data Validation** +- **Classes**: + - `TextChunk`: Represents segmented text chunks with their embeddings. + - `Entity`: Tracks extracted entities, their attributes, and relationships. +- **Purpose**: Ensures data is clean and well-structured for downstream processing. + +#### **3. Data Collection** +- **Functions**: + - `scrape_html`: Fetches and parses webpage content. + - `uni_scrubber`: Cleans Unicode and formatting issues. + - `make_chunk`: Segments long text into manageable chunks for embedding. +- **Role**: Prepares raw, unstructured data for structured analysis. + +#### **4. Lexical Graph Construction** +- **Initialization**: + - `init_nlp`: Sets up NLP pipelines with spaCy, GLiNER (NER), and GLiREL (RE). +- **Graph Parsing**: + - `parse_text`: Creates lexical graphs using TextRank algorithms. + - `make_entity`: Extracts and integrates entities into the graph. + - `connect_entities`: Links entities co-occurring in the same context. +- **Purpose**: Converts text into a structured, connected graph of entities and relationships. + +#### **5. Numerical Processing** +- **Functions**: + - `calc_quantile_bins`: Creates quantile bins for numerical data. + - `root_mean_square`: Computes RMS for normalization. + - `stripe_column`: Applies quantile binning to data columns. +- **Role**: Provides statistical operations to refine and rank graph components. + +#### **6. TextRank Implementation** +- **Functions**: + - `run_textrank`: Ranks entities in the graph based on a PageRank-inspired algorithm. +- **Purpose**: Identifies and prioritizes key entities for knowledge graph construction. + +#### **7. Semantic Overlay** +- **Functions**: + - `abstract_overlay`: Abstracts a semantic layer from the lexical graph. + - Connects entities to their originating text chunks for context preservation. +- **Role**: Enhances the graph with higher-order relationships and semantic depth. + +#### **8. Visualization** +- **Tool**: `pyvis` +- **Functions**: + - `gen_pyvis`: Creates an interactive visualization of the knowledge graph. +- **Features**: + - Node sizing reflects entity importance. + - Physics-based layout supports intuitive exploration. + +#### **9. Orchestration** +- **Function**: + - `construct_kg`: Orchestrates the full pipeline from data collection to visualization. +- **Purpose**: Ensures the seamless integration of all layers and components. + +--- + +### **Notable Implementation Details** + +- **Multi-Layer Graph Representation**: Combines lexical and semantic graphs for layered analysis. +- **Vector Embedding Integration**: Enhances entity representation with embeddings. +- **Error Handling and Debugging**: Includes robust logging and debugging features. +- **Scalability**: Designed for handling diverse and large datasets with dynamic relationships. + +--- + +## Appendix: Architectural Workflow + +### **1. Architectural Workflow: A Layered Approach to Knowledge Graph Construction** + +#### **1.1 Workflow Layers** + +**Data Ingestion:** +- Role: Extract raw data from structured and unstructured sources for downstream processing. +- Responsibilities: Handle diverse data formats, ensure quality, and standardize for analysis. +- Requirements: Reliable scraping, parsing, and chunking mechanisms to prepare data for embedding and analysis. + +**Lexical Graph Construction:** +- Role: Build a foundational graph by integrating tokenized data and semantic relationships. +- Responsibilities: Identify key entities through tokenization and ranking (e.g., TextRank). +- Requirements: Efficient methods for integrating named entities and relationships into a coherent graph structure. + +**Entity and Relation Extraction:** +- Role: Identify and label entities, along with their relationships, to enrich the graph structure. +- Responsibilities: Extract domain-specific entities (NER) and relationships (RE) to add connectivity. +- Requirements: Domain-tuned models and algorithms for accurate extraction. + +**Graph Construction and Visualization:** +- Role: Develop and display the knowledge graph to facilitate analysis and decision-making. +- Responsibilities: Create a graph structure using tools like NetworkX and enable exploration with interactive visualizations (e.g., PyVis). +- Requirements: Scalable graph-building frameworks and intuitive visualization tools. + +**Semantic Overlay:** +- Role: Enhance the graph with additional context and reasoning capabilities. +- Responsibilities: Integrate ontologies, taxonomies, and domain-specific knowledge to provide depth and precision. +- Requirements: Mechanisms to map structured data into graph elements and ensure consistency with existing knowledge bases. + + +### **2. Visualized Workflow** + +#### **2.1 Logical Data Flow** + +```mermaid +graph TD +A[Raw Data] -->|Scrape| B[Chunks] +B -->|Lexical Parsing| C[Lexical Graph] +C -->|NER + RE| D[Entities and Relations] +D -->|Construct KG| E[Knowledge Graph] +E -->|Overlay Ontologies| F[Enriched Graph] +F -->|Visualize| G[Interactive View] +``` + +--- + +### **3. Glossary** + +| **Participant** | **Description** | **Workflow Layer** | +|--------------------------------|---------------------------------------------------------------------------------------------------|-------------------------------------| +| **HTML Scraper (BeautifulSoup)** | Fetches unstructured text data from web sources. | Data Ingestion | +| **Text Chunker** | Breaks raw text into manageable chunks (e.g., 1024 tokens) and prepares them for embedding. | Data Ingestion | +| **SpaCy Pipeline** | Processes chunks and integrates GLiNER and GLiREL for entity and relation extraction. | Entity and Relation Extraction | +| **Embedding Model (bge-small-en-v1.5)** | Captures lower-level lexical meanings of text and translates them into machine-readable vector representations. | Data Ingestion | +| **GLiNER** | Identifies domain-specific entities and returns labeled outputs. | Entity and Relation Extraction | +| **GLiREL** | Extracts relationships between identified entities, adding connectivity to the graph. | Entity and Relation Extraction | +| **Vector Database (LanceDB)** | Stores chunk embeddings for efficient querying in downstream tasks. | Data Ingestion | +| **Word2Vec (Gensim)** | Generates entity embeddings based on graph co-occurrence for additional analysis. | Semantic Graph Construction | +| **Graph Constructor (NetworkX)** | Builds and analyzes the knowledge graph, ranking entities using TextRank. | Graph Construction and Visualization | +| **Graph Visualizer (PyVis)** | Provides an interactive visualization of the knowledge graph for interpretability. | Graph Construction and Visualization | + +## Citations: giving credit where credit is due... + +Inspired by the great work done by multiple individuals who created the [Connected Data London 2024: Entity Resolved Knowledge Graphs](https://github.com/donbr/cdl2024_masterclass/blob/main/README.md) masterclass I created this document to highlight areas that rang true. + +- Paco Nathan https://senzing.com/consult-entity-resolution-paco/ +- Clair Sullivan https://clairsullivan.com/ +- Louis Guitton https://guitton.co/ +- Jeff Butcher https://github.com/jbutcher21 +- Michael Dockter https://github.com/docktermj + +The code to use GLiNER and GLiREL started as a fork of one of four repos that make up the masterclass. diff --git a/gtp_aurk_inhibitors.html b/gtp_aurk_inhibitors.html new file mode 100644 index 0000000000000000000000000000000000000000..d85a481687d7ea344f73a848880f4c27f1664eef --- /dev/null +++ b/gtp_aurk_inhibitors.html @@ -0,0 +1,222 @@ + + + + + + + + + +
+

+
+ + + + + + +
+

+
+ + + + + +
+ + +
+
+ + +
+
+
0%
+
+
+
+
+
+ + + + + + \ No newline at end of file diff --git a/pmid_35559673_interactions_network.html b/pmid_35559673_interactions_network.html new file mode 100644 index 0000000000000000000000000000000000000000..797170cb8562bb4ad5ecca1380c3b53c7f54e377 --- /dev/null +++ b/pmid_35559673_interactions_network.html @@ -0,0 +1,14 @@ + + + +
+
+ + \ No newline at end of file diff --git a/winston_churchill_we_shall_fight_speech_june_1940_txt.html b/winston_churchill_we_shall_fight_speech_june_1940_txt.html new file mode 100644 index 0000000000000000000000000000000000000000..f5d6debeee44a85ebb4c171ee5823150cb9e2265 --- /dev/null +++ b/winston_churchill_we_shall_fight_speech_june_1940_txt.html @@ -0,0 +1,189 @@ + + + + + + + + + +
+

+
+ + + + + + +
+

+
+ + + + + +
+ + +
+
+ + + + + + +
+

Entity Types

+ +
+ + \ No newline at end of file