Spaces:

derek-thomas
/

arabic-RAG

Paused

App Files Files Community

derek-thomas commited on Oct 28, 2023

Commit

70bad37

1 Parent(s): 6404d3b

Adding notebooks to iterate on, and cleaning other code

Browse files

Files changed (10) hide show

.gitignore +4 -1
data/consolidated/.gitkeep +0 -0
data/processed/.gitkeep +0 -0
data/raw/.gitkeep +0 -0
notebooks/01_get_data.ipynb +274 -0
notebooks/02_preprocessing.ipynb +359 -0
notebooks/03_get_embeddings.ipynb +441 -0
notebooks/04_vector_db.ipynb +241 -0
preprocess_wiki.py +0 -167
src/preprocessing/consolidate.py +85 -0

.gitignore CHANGED Viewed

@@ -1,4 +1,7 @@
 *.bz2
 *.gz
 output/
-.idea/

 *.bz2
 *.gz
 output/
+.idea/
+notebooks/.
+notebooks/.ipynb_checkpoints/*
+data/*/*

data/consolidated/.gitkeep ADDED Viewed

File without changes

data/processed/.gitkeep ADDED Viewed

File without changes

data/raw/.gitkeep ADDED Viewed

File without changes

notebooks/01_get_data.ipynb ADDED Viewed

	@@ -0,0 +1,274 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "883a8a6a-d0b5-40ea-90a0-5b33d3332360",
+   "metadata": {},
+   "source": [
+    "# Get Data\n",
+    "The data from wikipedia starts in XML, this is a relatively simple way to format that into a single json for our purposes."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a7d66da5-185c-409e-9568-f211ca4b725e",
+   "metadata": {},
+   "source": [
+    "## Initialize Variables"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "ea8ae64c-f597-4c94-b93d-1b78060d7953",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "import sys"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "2f9527f9-4756-478b-99ac-a3c8c26ab63e",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "proj_dir_path = Path.cwd().parent\n",
+    "proj_dir = str(proj_dir_path)\n",
+    "\n",
+    "# So we can import later\n",
+    "sys.path.append(proj_dir)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "860da614-743b-4060-9d22-673896414cbd",
+   "metadata": {},
+   "source": [
+    "## Install Libraries"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "8bec29e3-8434-491f-914c-13f303dc68f3",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Note: you may need to restart the kernel to use updated packages.\n"
+     ]
+    }
+   ],
+   "source": [
+    "%pip install -q -r \"$proj_dir\"/requirements.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b928c71f-7e34-47ee-b55e-aa12d5118ba7",
+   "metadata": {},
+   "source": [
+    "## Download Latest Arabic Wikipedia"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f1dc5f57-c877-43e3-8131-4f351b99168d",
+   "metadata": {},
+   "source": [
+    "Im getting \"latest\" but its good to see what version it is nonetheless."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "fe4b357f-88fe-44b5-9fce-354404b1447f",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Last-Modified: Sun, 01 Oct 2023 23:32:27 GMT\n"
+     ]
+    }
+   ],
+   "source": [
+    "!curl -I https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles-multistream.xml.bz2 --silent | grep \"Last-Modified\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fe62d4a3-b59b-40c4-9a8c-bf0a447a9ec2",
+   "metadata": {},
+   "source": [
+    "Download simple wikipedia"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "0f309c12-12de-4460-a03f-bd5b6fcc942c",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "--2023-10-18 10:55:38--  https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles-multistream.xml.bz2\n",
+      "Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142\n",
+      "Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.\n",
+      "HTTP request sent, awaiting response... 200 OK\n",
+      "Length: 286759308 (273M) [application/octet-stream]\n",
+      "Saving to: ‘/home/ec2-user/RAGDemo/data/raw/simplewiki-latest-pages-articles-multistream.xml.bz2’\n",
+      "\n",
+      "100%[======================================>] 286,759,308 4.22MB/s   in 66s    \n",
+      "\n",
+      "2023-10-18 10:56:45 (4.13 MB/s) - ‘/home/ec2-user/RAGDemo/data/raw/simplewiki-latest-pages-articles-multistream.xml.bz2’ saved [286759308/286759308]\n"
+     ]
+    }
+   ],
+   "source": [
+    "!wget -nc -P \"$proj_dir\"/data/raw https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles-multistream.xml.bz2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "46af5df6-5785-400a-986c-54a2c98768ea",
+   "metadata": {},
+   "source": [
+    "## Extract from XML\n",
+    "The download format from wikipedia is in XML. `wikiextractor` will convert this into a jsonl format split into many folders and files."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "c22dedcd-73b3-4aad-8eb7-1063954967ed",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "INFO: Preprocessing '/home/ec2-user/RAGDemo/data/raw/simplewiki-latest-pages-articles-multistream.xml.bz2' to collect template definitions: this may take some time.\n",
+      "INFO: Preprocessed 100000 pages\n",
+      "INFO: Preprocessed 200000 pages\n",
+      "INFO: Preprocessed 300000 pages\n",
+      "INFO: Preprocessed 400000 pages\n",
+      "INFO: Loaded 36594 templates in 54.1s\n",
+      "INFO: Starting page extraction from /home/ec2-user/RAGDemo/data/raw/simplewiki-latest-pages-articles-multistream.xml.bz2.\n",
+      "INFO: Using 3 extract processes.\n",
+      "INFO: Extracted 100000 articles (3481.4 art/s)\n",
+      "INFO: Extracted 200000 articles (3764.9 art/s)\n",
+      "INFO: Extracted 300000 articles (4175.8 art/s)\n",
+      "INFO: Finished 3-process extraction of 332024 articles in 86.9s (3822.7 art/s)\n"
+     ]
+    }
+   ],
+   "source": [
+    "!wikiextractor -o \"$proj_dir\"/data/raw/output  --json \"$proj_dir\"/data/raw/arwiki-latest-pages-articles-multistream.xml.bz2 "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bb8063c6-1bed-49f0-948a-eeb9a7933b4a",
+   "metadata": {},
+   "source": [
+    "## Consolidate into json\n",
+    "\n",
+    "The split format is tedious to deal with, so now we we will consolidate this into 1 json file. This is fine since our data fits easily in RAM. But if it didnt, there are better options.\n",
+    "\n",
+    "Feel free to check out the [consolidate file](../src/preprocessing/consolidate.py) for more details."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "0a4ce3aa-9c1e-45e4-8219-a1714f482371",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from src.preprocessing.consolidate import folder_to_json"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "3e93da6a-e304-450c-a81e-ffecaf0d8a9a",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "3f045c61ef544f34a1d6f7c4236b206c",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Processing:   0%|          | 0/206 [00:00<?, ?file/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Wiki processed in 2.92 seconds!\n",
+      "Writing file!\n",
+      "File written in 3.08 seconds!\n"
+     ]
+    }
+   ],
+   "source": [
+    "folder = proj_dir_path / 'data/raw/output'\n",
+    "folder_out = proj_dir_path / 'data/consolidated/'\n",
+    "folder_to_json(folder, folder_out, 'ar_wiki')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

notebooks/02_preprocessing.ipynb ADDED Viewed

	@@ -0,0 +1,359 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "b1b28232-b65d-41ce-88de-fd70b93a528d",
+   "metadata": {},
+   "source": [
+    "# Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "abb5186b-ee67-4e1e-882d-3d8d5b4575d4",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "from pathlib import Path\n",
+    "import pickle\n",
+    "from tqdm.auto import tqdm\n",
+    "\n",
+    "from haystack.nodes.preprocessor import PreProcessor"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "c4b82ea2-8b30-4c2e-99f0-9a30f2f1bfb7",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/home/ec2-user/RAGDemo\n"
+     ]
+    }
+   ],
+   "source": [
+    "proj_dir = Path.cwd().parent\n",
+    "print(proj_dir)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "76119e74-f601-436d-a253-63c5a19d1c83",
+   "metadata": {},
+   "source": [
+    "# Config"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "f6f74545-54a7-4f41-9f02-96964e1417f0",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "files_in = list((proj_dir / 'data/consolidated').glob('*.ndjson'))\n",
+    "folder_out = proj_dir / 'data/processed'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6a643cf2-abce-48a9-b4e0-478bcbee28c3",
+   "metadata": {},
+   "source": [
+    "# Preprocessing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a8f9630e-447e-423e-9f6c-e1dbc654f2dd",
+   "metadata": {},
+   "source": [
+    "Its important to choose good pre-processing options. \n",
+    "\n",
+    "Clean whitespace helps each stage of RAG. It adds noise to the embeddings, and wastes space when we prompt with it.\n",
+    "\n",
+    "I chose to split by word as it would be tedious to tokenize here, and that doesnt scale well. The context length for most embedding models ends up being 512 tokens. This is ~400 words. \n",
+    "\n",
+    "I like to respect the sentence boundary, thats why I gave a ~50 word buffer."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "18807aea-24e4-4d74-bf10-55b24f3cb52c",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...\n",
+      "[nltk_data]   Unzipping tokenizers/punkt.zip.\n"
+     ]
+    }
+   ],
+   "source": [
+    "pp = PreProcessor(clean_whitespace = True,\n",
+    "             clean_header_footer = False,\n",
+    "             clean_empty_lines = True,\n",
+    "             remove_substrings = None,\n",
+    "             split_by='word',\n",
+    "             split_length = 350,\n",
+    "             split_overlap = 50,\n",
+    "             split_respect_sentence_boundary = True,\n",
+    "             tokenizer_model_folder = None,\n",
+    "             language = \"en\",\n",
+    "             id_hash_keys = None,\n",
+    "             progress_bar = True,\n",
+    "             add_page_number = False,\n",
+    "             max_chars_check = 10_000)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "dab1658a-79a7-40f2-9a8c-1798e0d124bf",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "with open(file_in, 'r', encoding='utf-8') as f:\n",
+    "    list_of_articles = json.load(f)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "4ca6e576-4b7d-4c1a-916f-41d1b82be647",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Preprocessing:   0%|▌                                                                                                                      | 1551/332023 [00:02<09:44, 565.82docs/s]We found one or more sentences whose word count is higher than the split length.\n",
+      "Preprocessing:  83%|████████████████████████████████████████████████████████████████████████████████████████████████▌                   | 276427/332023 [02:12<00:20, 2652.57docs/s]Document 81972e5bc1997b1ed4fb86d17f061a41 is 21206 characters long after preprocessing, where the maximum length should be 10000. Something might be wrong with the splitting, check the document affected to prevent issues at query time. This document will be now hard-split at 10000 chars recursively.\n",
+      "Document 5e63e848e42966ddc747257fb7cf4092 is 11206 characters long after preprocessing, where the maximum length should be 10000. Something might be wrong with the splitting, check the document affected to prevent issues at query time. This document will be now hard-split at 10000 chars recursively.\n",
+      "Preprocessing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 332023/332023 [02:29<00:00, 2219.16docs/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "documents = pp.process(list_of_articles)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f00dbdb2-906f-4d5a-a3f1-b0d84385d85a",
+   "metadata": {},
+   "source": [
+    "When we break a wikipedia article up, we lose some of the context. The local context is somewhat preserved by the `split_overlap`. Im trying to preserve the global context by adding a prefix that has the article's title.\n",
+    "\n",
+    "You could enhance this with the summary as well. This is mostly to help the retrieval step of RAG. Note that the way Im doing it alters some of `haystack`'s features like the hash and the lengths, but those arent too necessary. \n",
+    "\n",
+    "A more advanced way for many business applications would be to summarize the document and add that as a prefix for sub-documents.\n",
+    "\n",
+    "One last thing to note, is that it would be prudent (in some use-cases) to preserve the original document without the summary to give to the reader (retrieve with the summary but prompt without), but since this is a simple use-case I wont be doing that."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "076e115d-3e88-49d2-bc5d-f725a94e4964",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "ba764e7bf29f4202a74e08576a29f4e4",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "  0%|          | 0/268980 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "# Prefix each document's content\n",
+    "for document in tqdm(documents):\n",
+    "    if document.meta['_split_id'] != 0:\n",
+    "        document.content = f'Title: {document.meta[\"title\"]}. ' + document.content"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "72c1849c-1f4d-411f-b74b-6208b1e48217",
+   "metadata": {},
+   "source": [
+    "## Pre-processing Examples"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "02c1c6c8-6283-49a8-9d29-c355f1b08540",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<Document: {'content': \"April (Apr.) is the fourth month of the year in the Julian and Gregorian calendars, and comes between March and May. It is one of the four months to have 30 days.\\nApril always begins on the same day of the week as July, and additionally, January in leap years. April always ends on the same day of the week as December.\\nThe Month.\\nApril comes between March and May, making it the fourth month of the year. It also comes first in the year out of the four months that have 30 days, as June, September and November are later in the year.\\nApril begins on the same day of the week as July every year and on the same day of the week as January in leap years. April ends on the same day of the week as December every year, as each other's last days are exactly 35 weeks (245 days) apart.\\nIn common years, April starts on the same day of the week as October of the previous year, and in leap years, May of the previous year. In common years, April finishes on the same day of the week as July of the previous year, and in leap years, February and October of the previous year. In common years immediately after other common years, April starts on the same day of the week as January of the previous year, and in leap years and years immediately after that, April finishes on the same day of the week as January of the previous year.\\nIn years immediately before common years, April starts on the same day of the week as September and December of the following year, and in years immediately before leap years, June of the following year. In years immediately before common years, April finishes on the same day of the week as September of the following year, and in years immediately before leap years, March and June of the following year.\\nApril is a spring month in the Northern Hemisphere and an autumn/fall month in the Southern Hemisphere. \", 'content_type': 'text', 'score': None, 'meta': {'id': '1', 'revid': '9086769', 'url': 'https://simple.wikipedia.org/wiki?curid=1', 'title': 'April', '_split_id': 0, '_split_overlap': [{'doc_id': '79a74c1e6444dd0a1acd72840e9dd7c0', 'range': (1529, 1835)}]}, 'id_hash_keys': ['content'], 'embedding': None, 'id': 'a1c2acf337dbc3baa6f7f58403dfb95d'}>"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "documents[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "b34890bf-9dba-459a-9b0d-aa4b5929cbe8",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<Document: {'content': 'Title: April. In years immediately before common years, April finishes on the same day of the week as September of the following year, and in years immediately before leap years, March and June of the following year.\\nApril is a spring month in the Northern Hemisphere and an autumn/fall month in the Southern Hemisphere. In each hemisphere, it is the seasonal equivalent of October in the other.\\nIt is unclear as to where April got its name. A common theory is that it comes from the Latin word \"aperire\", meaning \"to open\", referring to flowers opening in spring. Another theory is that the name could come from Aphrodite, the Greek goddess of love. It was originally the second month in the old Roman Calendar, before the start of the new year was put to January 1.\\nQuite a few festivals are held in this month. In many Southeast Asian cultures, new year is celebrated in this month (including Songkran). In Western Christianity, Easter can be celebrated on a Sunday between March 22 and April 25. In Orthodox Christianity, it can fall between April 4 and May 8. At the end of the month, Central and Northern European cultures celebrate Walpurgis Night on April 30, marking the transition from winter into summer.\\nApril in poetry.\\nPoets use \"April\" to mean the end of winter. For example: \"April showers bring May flowers.\"', 'content_type': 'text', 'score': None, 'meta': {'id': '1', 'revid': '9086769', 'url': 'https://simple.wikipedia.org/wiki?curid=1', 'title': 'April', '_split_id': 1, '_split_overlap': [{'doc_id': 'a1c2acf337dbc3baa6f7f58403dfb95d', 'range': (0, 306)}]}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '79a74c1e6444dd0a1acd72840e9dd7c0'}>"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "documents[1]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "e6f50c27-a486-47e9-ba60-d567f5e530db",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<Document: {'content': 'Title: Chief Joseph. He knew he could not trust them anymore. He was tired of being considered a savage. He felt it was not fair for people who were born on the same land to be treated differently. He delivered a lot of speeches on this subject, which are still really good examples of eloquence. But he did not feel listened to, and when he died in his reservation in 1904, the doctor said he \"died from sadness\". He was buried in Colville Native American Burial Ground, in Washington State.', 'content_type': 'text', 'score': None, 'meta': {'id': '19310', 'revid': '16695', 'url': 'https://simple.wikipedia.org/wiki?curid=19310', 'title': 'Chief Joseph', '_split_id': 1, '_split_overlap': [{'doc_id': '4bdf9cecd46c3bfac6b225aed940e798', 'range': (0, 275)}]}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '91bc8240c5d067ab24f35c11f8916fc6'}>"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "documents[10102]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "5485cc27-3d3f-4b96-8884-accf5324da2d",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of Articles: 332023\n",
+      "Number of processed articles: 237724\n",
+      "Number of processed documents: 268980\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(f'Number of Articles: {len(list_of_articles)}')\n",
+    "processed_articles = len([d for d in documents if d.meta['_split_id'] == 0])\n",
+    "print(f'Number of processed articles: {processed_articles}')\n",
+    "print(f'Number of processed documents: {len(documents)}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "23ce57a8-d14e-426d-abc2-0ce5cdbc881a",
+   "metadata": {},
+   "source": [
+    "# Write to file"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "0d044870-7a30-4e09-aad2-42f24a52780d",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "with open(file_out, 'wb') as handle:\n",
+    "    pickle.dump(documents, handle, protocol=pickle.HIGHEST_PROTOCOL)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c5833dba-1bf6-48aa-be6f-0d70c71e54aa",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

notebooks/03_get_embeddings.ipynb ADDED Viewed

	@@ -0,0 +1,441 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a0f21cb1-fbc8-4282-b902-f47d92974df8",
+   "metadata": {},
+   "source": [
+    "# Pre-requisites"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5f625807-0707-4e2f-a0e0-8fbcdf08c865",
+   "metadata": {},
+   "source": [
+    "## Why TEI\n",
+    "There are 2 **unsung** challenges with RAG at scale:\n",
+    "1. Getting the embeddings efficiently\n",
+    "1. Efficient ingestion into the vector DB\n",
+    "\n",
+    "The issue with `1.` is that there are techniques but they are not widely *applied*. TEI solves a number of aspects:\n",
+    "- Token Based Dynamic Batching\n",
+    "- Using latest optimizations (Flash Attention, Candle and cuBLASLt)\n",
+    "- Fast loading with safetensors\n",
+    "\n",
+    "The issue with `2.` is that it takes a bit of planning. We wont go much into that side of things here though."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3102abce-ea42-4da6-8c98-c6dd4edf7f0b",
+   "metadata": {},
+   "source": [
+    "## Start TEI\n",
+    "Run [TEI](https://github.com/huggingface/text-embeddings-inference#docker), I have this running in a nvidia-docker container, but you can install as you like. Note that I ran this in a different terminal for monitoring and seperation. \n",
+    "\n",
+    "Note that as its running, its always going to pull the latest. Its at a very early stage at the time of writing. \n",
+    "\n",
+    "I chose the smaller [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) instead of the large. Its just as good on [mteb/leaderboard](https://huggingface.co/spaces/mteb/leaderboard) but its faster and smaller. TEI is fast, but this will make our life easier for storage and retrieval.\n",
+    "\n",
+    "I use the `revision=refs/pr/1` because this has the pull request with [safetensors](https://github.com/huggingface/safetensors) which is required by TEI. Check out the [pull request](https://huggingface.co/BAAI/bge-base-en-v1.5/discussions/1) if you want to use a different embedding model and it doesnt have safetensors."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "7e873652-8257-4aae-92bc-94e1bac54b73",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "# volume=$PWD/data\n",
+    "# model=BAAI/bge-base-en-v1.5\n",
+    "# revision=refs/pr/1\n",
+    "# docker run \\\n",
+    "#     --gpus all \\\n",
+    "#     -p 8080:80 \\\n",
+    "#     -v $volume:/data \\\n",
+    "#     --pull always \\\n",
+    "#     ghcr.io/huggingface/text-embeddings-inference:latest \\\n",
+    "#     --model-id $model \\\n",
+    "#     --revision $revision \\\n",
+    "#     --pooling cls \\\n",
+    "#     --max-batch-tokens 65536"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "86a5ff83-1038-4880-8c90-dc3cab75cb49",
+   "metadata": {},
+   "source": [
+    "## Test Endpoint"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "52edfc97-5b6f-44f9-8d89-8578cf79fae9",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "passed\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%bash\n",
+    "\n",
+    "response_code=$(curl -s -o /dev/null -w \"%{http_code}\" 127.0.0.1:8080/embed \\\n",
+    "    -X POST \\\n",
+    "    -d '{\"inputs\":\"What is Deep Learning?\"}' \\\n",
+    "    -H 'Content-Type: application/json')\n",
+    "\n",
+    "if [ \"$response_code\" -eq 200 ]; then\n",
+    "    echo \"passed\"\n",
+    "else\n",
+    "    echo \"failed\"\n",
+    "fi"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b1b28232-b65d-41ce-88de-fd70b93a528d",
+   "metadata": {},
+   "source": [
+    "# Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "88408486-566a-4791-8ef2-5ee3e6941156",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from IPython.core.interactiveshell import InteractiveShell\n",
+    "InteractiveShell.ast_node_interactivity = 'all'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "abb5186b-ee67-4e1e-882d-3d8d5b4575d4",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import asyncio\n",
+    "from pathlib import Path\n",
+    "import pickle\n",
+    "\n",
+    "import aiohttp\n",
+    "from tqdm.notebook import tqdm"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "c4b82ea2-8b30-4c2e-99f0-9a30f2f1bfb7",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/home/ec2-user/RAGDemo\n"
+     ]
+    }
+   ],
+   "source": [
+    "proj_dir = Path.cwd().parent\n",
+    "print(proj_dir)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "76119e74-f601-436d-a253-63c5a19d1c83",
+   "metadata": {},
+   "source": [
+    "# Config"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0d2bcda7-b245-45e3-a347-34166f217e1e",
+   "metadata": {},
+   "source": [
+    "I'm putting the documents in pickle files. The compression is nice, though its important to note pickles are known to be a security risk."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "f6f74545-54a7-4f41-9f02-96964e1417f0",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "file_in = proj_dir / 'data/processed/simple_wiki_processed.pkl'\n",
+    "file_out = proj_dir / 'data/processed/simple_wiki_embeddings.pkl'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d2dd0df0-4274-45b3-9ee5-0205494e4d75",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Setup\n",
+    "Read in our list of documents and convert them to dictionaries for processing."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "3c08e039-3686-4eca-9f87-7c469e3f19bc",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 6.24 s, sys: 928 ms, total: 7.17 s\n",
+      "Wall time: 6.61 s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "with open(file_in, 'rb') as handle:\n",
+    "    documents = pickle.load(handle)\n",
+    "\n",
+    "documents = [document.to_dict() for document in documents]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5e73235d-6274-4958-9e57-977afeeb5f1b",
+   "metadata": {},
+   "source": [
+    "# Embed\n",
+    "## Strategy\n",
+    "TEI allows multiple concurrent requests, so its important that we dont waste the potential we have. I used the default `max-concurrent-requests` value of `512`, so I want to use that many `MAX_WORKERS`.\n",
+    "\n",
+    "Im using an `async` way of making requests that uses `aiohttp` as well as a nice progress bar. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "949d6bf8-804f-496b-a59a-834483cc7073",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# Constants\n",
+    "ENDPOINT = \"http://127.0.0.1:8080/embed\"\n",
+    "HEADERS = {'Content-Type': 'application/json'}\n",
+    "MAX_WORKERS = 512"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cf3da8cc-1651-4704-9091-39c2a1b835be",
+   "metadata": {},
+   "source": [
+    "Note that Im using `'truncate':True` as even with our `350` word split earlier, there are always exceptions. Its important that as this scales we have as few issues as possible when embedding. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "3353c849-a36c-4047-bb81-93dac6c49b68",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "async def fetch(session, url, document):\n",
+    "    payload = {\"inputs\": [document[\"content\"]], 'truncate':True}\n",
+    "    async with session.post(url, json=payload) as response:\n",
+    "        if response.status == 200:\n",
+    "            resp_json = await response.json()\n",
+    "            # Assuming the server's response contains an 'embedding' field\n",
+    "            document[\"embedding\"] = resp_json[0]\n",
+    "        else:\n",
+    "            print(f\"Error {response.status}: {await response.text()}\")\n",
+    "            # Handle error appropriately if needed\n",
+    "\n",
+    "async def main(documents):\n",
+    "    async with aiohttp.ClientSession(headers=HEADERS) as session:\n",
+    "        tasks = [fetch(session, ENDPOINT, doc) for doc in documents]\n",
+    "        await asyncio.gather(*tasks)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "f0d17264-72dc-40be-aa46-17cde38c8189",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "f0ff772e915f4432971317e2150b60f2",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Processing documents:   0%|          | 0/526 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "# Create a list of async tasks\n",
+    "tasks = [main(documents[i:i+MAX_WORKERS]) for i in range(0, len(documents), MAX_WORKERS)]\n",
+    "\n",
+    "# Add a progress bar for visual feedback and run tasks\n",
+    "for task in tqdm(tasks, desc=\"Processing documents\"):\n",
+    "    await task"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f90a0ed7-b5e9-4ae4-9e87-4c04875ebcc9",
+   "metadata": {},
+   "source": [
+    "Lets double check that we got all the embeddings we expected!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "3950fa88-9961-4b33-9719-d5804509d4cf",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "268980"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "268980"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "count = 0\n",
+    "for document in documents:\n",
+    "    if len(document['embedding']) == 768:\n",
+    "        count += 1\n",
+    "count\n",
+    "len(documents)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5b78bfa4-d365-4906-a71c-f444eabf6bf8",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "Great, we can see that they match.\n",
+    "\n",
+    "Let's write our embeddings to file"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "58d437a5-473f-4eae-9dbf-e8e6992754f6",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 5.68 s, sys: 640 ms, total: 6.32 s\n",
+      "Wall time: 14.1 s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "with open(file_out, 'wb') as handle:\n",
+    "    pickle.dump(documents, handle, protocol=pickle.HIGHEST_PROTOCOL)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fc1e7cc5-b878-42bb-9fb4-e810f3f5006a",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Next Steps\n",
+    "We need to import this into a vector db. "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

notebooks/04_vector_db.ipynb ADDED Viewed

	@@ -0,0 +1,241 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6a151ade-7d86-4a2e-bfe7-462089f4e04c",
+   "metadata": {},
+   "source": [
+    "# Approach\n",
+    "There are a number of aspects of choosing a vector db that might be unique to your situation. You should think through your HW, utilization, latency requirements, scale, etc before choosing. \n",
+    "\n",
+    "Im targeting a demo (low utilization, latency can be relaxed) that will live on a huggingface space. I have a small scale that could even fit in memory. I like [Qdrant](https://qdrant.tech) for this. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b1b28232-b65d-41ce-88de-fd70b93a528d",
+   "metadata": {},
+   "source": [
+    "# Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "88408486-566a-4791-8ef2-5ee3e6941156",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from IPython.core.interactiveshell import InteractiveShell\n",
+    "InteractiveShell.ast_node_interactivity = 'all'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "abb5186b-ee67-4e1e-882d-3d8d5b4575d4",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "import pickle\n",
+    "\n",
+    "from tqdm.notebook import tqdm\n",
+    "from haystack.schema import Document\n",
+    "from qdrant_haystack import QdrantDocumentStore"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "c4b82ea2-8b30-4c2e-99f0-9a30f2f1bfb7",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/home/ec2-user/RAGDemo\n"
+     ]
+    }
+   ],
+   "source": [
+    "proj_dir = Path.cwd().parent\n",
+    "print(proj_dir)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "76119e74-f601-436d-a253-63c5a19d1c83",
+   "metadata": {},
+   "source": [
+    "# Config"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "f6f74545-54a7-4f41-9f02-96964e1417f0",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "file_in = proj_dir / 'data/processed/simple_wiki_embeddings.pkl'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d2dd0df0-4274-45b3-9ee5-0205494e4d75",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Setup\n",
+    "Read in our list of dictionaries. This is the upper end for the machine Im using. This takes ~10GB of RAM. We could easily do this in batches of ~100k and be fine in most machines. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "3c08e039-3686-4eca-9f87-7c469e3f19bc",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 11.6 s, sys: 2.25 s, total: 13.9 s\n",
+      "Wall time: 18.1 s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "with open(file_in, 'rb') as handle:\n",
+    "    documents = pickle.load(handle)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "98aec715-8d97-439e-99c0-0eff63df386b",
+   "metadata": {},
+   "source": [
+    "Convert the dictionaries to `Documents`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "4821e3c1-697d-4b69-bae3-300168755df9",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "documents = [Document.from_dict(d) for d in documents]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "676f644c-fb09-4d17-89ba-30c92aad8777",
+   "metadata": {},
+   "source": [
+    "Instantiate our `DocumentStore`. Note that Im saving this to disk, this is for portability which is good considering I want to move from this ec2 instance into a Hugging Face Space. \n",
+    "\n",
+    "Note that if you are doing this at scale, you should use a proper instance and not saving to file. You should also take a [measured ingestion](https://qdrant.tech/documentation/tutorials/bulk-upload/) approach instead of using a convenient loader. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "e51b6e19-3be8-4cb0-8b65-9d6f6121f660",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "document_store = QdrantDocumentStore(\n",
+    "    path=str(proj_dir/'Qdrant'),\n",
+    "    index=\"RAGDemo\",\n",
+    "    embedding_dim=768,\n",
+    "    recreate_index=True,\n",
+    "    hnsw_config={\"m\": 16, \"ef_construct\": 64}  # Optional\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "55fbcd5d-922c-4e93-a37a-974ba84464ac",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "270000it [28:43, 156.68it/s]                                                                                                          "
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 13min 23s, sys: 48.6 s, total: 14min 12s\n",
+      "Wall time: 28min 43s\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "document_store.write_documents(documents, batch_size=5_000)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9a073815-0191-48f7-890f-a4e4ecc0f9f1",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

preprocess_wiki.py DELETED Viewed

@@ -1,167 +0,0 @@
-import os
-import json
-from pathlib import Path
-from tqdm.auto import tqdm
-from typing import List, Any, Dict
-MAX_WORDS = 250
-def folder_to_json(folder_in: Path, json_path: Path) -> List[Any]:
-    """
-    Process JSON lines from files in a given folder and write processed data to a new JSON file.
-    Parameters:
-    folder_in (Path): Path to the input folder containing the JSON files to process.
-    json_path (Path): Path to the output JSON file where the processed data will be written.
-    Returns:
-    List[Any]: List containing processed JSON data from all files in the input folder.
-    Example:
-    folder_to_json(Path("/path/to/input/folder"), Path("/path/to/output.json"))
-    """
-    folder_in = Path(folder_in)
-    json_out = []  # Initialize list to hold processed JSON data from all files
-    # Calculate total number of files in the input folder to set up the progress bar
-    total_files = sum([len(files) for r, d, files in os.walk(folder_in)])
-    # Initialize progress bar with total file count, description, and unit of progress
-    with tqdm(total=total_files, desc='Processing', unit='file') as pbar:
-        # Iterate through all files in the input folder
-        for subdir, _, files in os.walk(folder_in):
-            # Set progress bar postfix to display current directory
-            pbar.set_postfix_str(f"Directory: {subdir}", refresh=False)
-            for file in files:
-                # Update progress bar postfix to display current file and directory
-                pbar.set_postfix_str(f"Dir: {subdir} | File: {file}", refresh=True)
-                # Create full file path for the current file
-                file_path = Path(subdir) / file
-                # Open and read the current file
-                with open(file_path, 'r', encoding='utf-8') as f:
-                    for line in f:
-                        # Load JSON data from each line and process it
-                        article = json.loads(line)
-                        # Ensure the preprocess function is defined and accessible
-                        processed_article = preprocess(article)
-                        # Add processed data to the output list
-                        json_out.extend(processed_article)
-                # Update progress bar after processing each file
-                pbar.update(1)
-    # Notify that the writing process is starting
-    pbar.write("Writing file!")
-    # Open the output file and write the processed data as JSON
-    with open(json_path, "w", encoding='utf-8') as outfile:
-        json.dump(json_out, outfile)
-    # Notify that the writing process is complete
-    pbar.write("File written!")
-    # Return the list of processed data
-    return json_out
-def preprocess(article: Dict[str, Any]) -> List[Dict[str, Any]]:
-    """
-    Preprocess a given article dictionary, extracting and processing the 'text' field. Because of the `break` introduced
-    we are only taking the first chunk
-    Parameters:
-    article (Dict[str, Any]): Input dictionary containing an article. Expected to have a 'text' field.
-    Returns:
-    List[Dict[str, Any]]: A list of dictionaries, where each dictionary represents a preprocessed chunk of
-                          the original article's text. Each dictionary also contains the original article's
-                          fields (excluding 'text'), with an additional 'chunk_number' field indicating the
-                          order of the chunk.
-    Example:
-    article = {"text": "Example text", "title": "Example Title", "author": "John Doe"}
-    processed = preprocess(article)
-    print(processed)
-    """
-    # Create a new dictionary excluding the 'text' field from the original article
-    article_out = {k: v for k, v in article.items() if k != 'text'}
-    # Create a prefix using the article's text. Adjust this line as needed based on the actual structure of 'article'
-    prefix = f'عنوان: {article["text"]}. '
-    out = []  # Initialize the list to hold the preprocessed chunks
-    # Iterate over chunks obtained by splitting the article's text using the group_arabic_paragraphs function
-    # Ensure group_arabic_paragraphs is defined and accessible
-    for i, chunk in enumerate(group_arabic_paragraphs(article['text'], MAX_WORDS)):
-        # Concatenate the prefix with the current chunk
-        chunk = prefix + chunk
-        # Create a new dictionary with the chunk, original article fields (excluding 'text'), and chunk number
-        # Then append this dictionary to the 'out' list
-        out.append({'chunk': chunk, **article_out, 'chunk_number': i})
-        # Only take the first chunk
-        break
-    # Return the list of preprocessed chunks
-    return out
-def group_arabic_paragraphs(arabic_text: str, max_words: int) -> List[str]:
-    """
-    Group contiguous paragraphs of Arabic text without exceeding the max_words limit per group.
-    Parameters:
-    arabic_text (str): The input Arabic text where paragraphs are separated by newlines.
-    max_words (int): The maximum number of words allowed per group of paragraphs.
-    Returns:
-    List[str]: A list of strings where each string is a group of contiguous paragraphs.
-    Example:
-    arabic_text = "Paragraph1.\nParagraph2.\nParagraph3."
-    max_words = 5
-    result = group_arabic_paragraphs(arabic_text, max_words)
-    print(result)  # Output will depend on word count of each paragraph and max_words.
-    """
-    # Splitting the input text into paragraphs using newline as a delimiter
-    paragraphs = arabic_text.split('\n')
-    # Initialize variables to hold the grouped paragraphs and word count
-    grouped_paragraphs = []
-    current_group = []
-    current_word_count = 0
-    # Iterate through each paragraph in the input text
-    for paragraph in paragraphs:
-        # Count the number of words in the paragraph
-        word_count = len(paragraph.split())
-        # If adding the paragraph won't exceed the word limit, add it to the current group
-        if current_word_count + word_count <= max_words:
-            current_group.append(paragraph)
-            current_word_count += word_count  # Update the word count for the current group
-        else:
-            # If the paragraph exceeds the word limit, start a new group
-            if current_group:
-                grouped_paragraphs.append('\n'.join(current_group))
-            # Initialize a new group with the current paragraph
-            current_group = [paragraph]
-            current_word_count = word_count  # Reset the word count for the new group
-    # Add the last group if not empty
-    if current_group:
-        grouped_paragraphs.append('\n'.join(current_group))
-    # Return the grouped paragraphs as a list of strings
-    return grouped_paragraphs
-if __name__ == '__main__':
-    folder = Path('output')
-    file_out = Path('arwiki.json')
-    folder_to_json(folder, file_out)
-    print('Done!')

src/preprocessing/consolidate.py ADDED Viewed

	@@ -0,0 +1,85 @@

+import json
+from pathlib import Path
+from time import perf_counter
+from typing import Any, Dict
+from tqdm.auto import tqdm
+def folder_to_json(folder_in: Path, folder_out: Path, json_file_name: str):
+    """
+    Process JSON lines from files in a given folder and write processed data to new ndjson files.
+    Parameters:
+    folder_in (Path): Path to the input folder containing the JSON files to process.
+    folder_out (Path): Path to the output folder for processed ndjson
+    json_file_name (str): Filename The files will be named as
+                           {json_base_path}_1.ndjson, {json_base_path}_2.ndjson, and so on.
+    Example:
+    folder_to_json(Path("/path/to/input/folder"), Path("/path/to/output/folder"), "ar_wiki")
+    """
+    json_out = []  # Initialize list to hold processed JSON data from all files
+    file_counter = 1  # Counter to increment file names
+    process_start = perf_counter()
+    all_files = sorted(folder_in.rglob('*wiki*'), key=lambda x: str(x))
+    with tqdm(total=len(all_files), desc='Processing', unit='file') as pbar:
+        for file_path in all_files:
+            pbar.set_postfix_str(f"File: {file_path.name} | Dir: {file_path.parent}", refresh=True)
+            with open(file_path, 'r', encoding='utf-8') as f:
+                for line in f:
+                    article = json.loads(line)
+                    json_out.append(restructure_articles(article))
+                    # If size of json_out is 100,000, dump to file and clear list
+                    if len(json_out) == 100_000:
+                        append_to_file(json_out, folder_out / f"{json_file_name}_{file_counter}.ndjson")
+                        json_out.clear()
+                        file_counter += 1
+            pbar.update(1)
+    if json_out:  # Dump any remaining items in json_out to file
+        append_to_file(json_out, folder_out / f"{json_file_name}_{file_counter}.ndjson")
+    time_taken_to_process = perf_counter() - process_start
+    pbar.write(f"Wiki processed in {round(time_taken_to_process, 2)} seconds!")
+def append_to_file(data: list, path: Path):
+    with open(path, 'w', encoding='utf-8') as outfile:
+        for item in data:
+            json.dump(item, outfile)
+            outfile.write('\n')
+def restructure_articles(article: Dict[str, Any]) -> Dict[str, Any]:
+    """
+    Restructures the given article into haystack's format, separating content and meta data.
+    Args:
+    - article (Dict[str, Any]): The article to restructure.
+    Returns:
+    - Dict[str, Any]: The restructured article.
+    """
+    # Extract content and separate meta data
+    article_out = {
+        'content': article['text'],
+        'meta': {k: v for k, v in article.items() if k != 'text'}
+        }
+    return article_out
+if __name__ == '__main__':
+    proj_dir = Path(__file__).parents[2]
+    folder = proj_dir / 'data/raw/output'
+    file_out = proj_dir / 'data/consolidated/ar_wiki.json'
+    folder_to_json(folder, file_out)
+    print('Done!')