Spaces:
Runtime error
Runtime error
| import pandas as pd | |
| import streamlit as st | |
| st.set_page_config( | |
| page_title="Implementation", | |
| page_icon="βοΈ", | |
| ) | |
| st.markdown("## What's under the hood? βοΈ") | |
| st.markdown( | |
| """ | |
| My Notion Companion is a LLM-powered conversational RAG to chat with documents from Notion. | |
| It uses hybrid search (lexical + semantic) search to find the relevant documents and a chat interface to interact with the docs. | |
| It uses only **open-sourced technologies** and can **run on a single Mac Mini**. | |
| Empowering technologies: | |
| - **The Framework**: uses [Langchain](https://python.langchain.com/docs/) | |
| - **The LLM**: uses π€-developed [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta). It has great inference speed, bilingual and instruction following capabilities | |
| - **The Datastores**: the documents were stored into both conventional lexical data form and embeeding-based vectorstore (uses [Redis](https://python.langchain.com/docs/integrations/vectorstores/redis)) | |
| - **The Embedding Model**: uses [`sentence-transformers/distiluse-base-multilingual-cased-v1`](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1). It has great inference speed and bilingual capability | |
| - **The Tokenizers**: uses π€'s [`AutoTokenizer`](AutoTokenizer) and Chinese text segmentation tool [`jieba`](https://github.com/fxsjy/jieba) (only in lexical search) | |
| - **The Lexical Search Tool**: uses [`rank_bm25`](https://github.com/dorianbrown/rank_bm25) | |
| - **The Computing**: uses [LlamaCpp](https://github.com/ggerganov/llama.cpp) to power the LLM in the local machine (a Mac Mini with M2 Pro chip) | |
| - **The Observability Tool**: uses [LangSmith](https://docs.smith.langchain.com/) | |
| - **The UI**: uses [Streamlit](https://docs.streamlit.io/) | |
| """ | |
| ) | |
| st.markdown( | |
| """ | |
| #### The E2E Pipeline | |
| - When a user enters a prompt, the assistant will try lexical search first | |
| - a query analyzer will analyze the query and extract keywords (for search) and domains (for metadata filtering) | |
| - the extracted domains will be compared against the metadata of documents, only those with a matched metadata will be retrieved | |
| - the keyword will be segmented into searchable tokens, then further compared against the metadata-filtered documents with BM25 lexical search algorithm | |
| - The fetched documents will be subject to a final match checker to ensure relevance | |
| - If lexical search doesn't return enough documents, the assistant will then try semantic search into the Redis vectorstore. Retrieved docs will also subject the QA by match checker. | |
| - All retrieved documents will be sent to LLM as part of a system prompt, the LLM will then act as a conversational RAG to chat with the user with knowledges from the provided documents | |
| """ | |
| ) | |
| st.image("resources/flowchart.png", caption="E2E workflow") | |
| st.markdown( | |
| """ | |
| #### Selecting the right LLM | |
| I have compared a wide range of Bi/Multi-lingual LLMs with 7B parameters that has a LlamaCpp-friendly gguf executable on HuggingFace (which can fit onto Mac Mini's GPU). | |
| I created conversational test cases to assess the models' instruction following, reasoning, helpfulness, coding, hallucinations and inference speed. | |
| Qwen models (Qwen 1.0 & 1.5), together with HuggingFace's zephyr-7b-beta come as the top 3, but Qwen models are overly creative and do not follow few-shot examples. | |
| Thus, the final candidate goes to **zephyr**. | |
| Access the complete LLM evaluation results [here](https://docs.google.com/spreadsheets/d/1OZKu6m0fHPYkbf9SBV6UUOE_flgBG7gphgyo2rzOpsU/edit?usp=sharing). | |
| """ | |
| ) | |
| df_llm = pd.read_csv("resources/llm_scores.csv", index_col=0) | |
| st.dataframe(df_llm) | |
| st.markdown( | |
| """ | |
| #### Selecting the right LLM Computing Platform | |
| I tested [Ollama](https://ollama.com/) first given its integrated, worry-free experiences that abstracted away the complexity of building environments and downloading LLMs. | |
| However, I hit some unresponsiveness when experimenting with different LLMs and switched to [LlamaCpp](https://github.com/ggerganov/llama.cpp) (one layer deeper as the empowering backend for Ollama) | |
| It works great so I sticked around. | |
| """ | |
| ) | |
| st.markdown( | |
| """ | |
| #### Selecting the right Vectordatabase | |
| Langchain supports a huge number of vectordatabases. Because I don't have any scalability concerns (<300 docs in total), | |
| I target on easiness, can run in local machine, supports to offload data into disk, and metadata fuzzy match. | |
| Redis ended up being the only option that satisfies all the criteria. | |
| """ | |
| ) | |
| df_vs = pd.read_csv("resources/vectordatabase_evaluation.csv", index_col=0) | |
| st.dataframe(df_vs) | |
| st.markdown( | |
| """ | |
| #### Selecting the right Embedding Model | |
| Many companies have released their embeddings models. Our search begins with bi/multi-lingual embedding models | |
| developed by top-tier tech companies and research labs, with sizes from 500MB-2.2GB. | |
| Our evaluation dataset contains hand-crafted question-document pairs. Where the document contains the information to answer the associated question. | |
| Similar to [**CLIP**](https://openai.com/research/clip) method, I uses a "contrastive loss function" to evaluate the model such that we maximize the differences between paired and unpaired question-doc pairs. | |
| ``` | |
| loss = np.abs( | |
| cos_sim(embedding(q), embedding(doc_paired)) - | |
| np.mean(cos_sim(embedding(q), embedding(doc_unpaired))) | |
| ) | |
| ``` | |
| In addition, I also considers model size and loading/inference speed for each model. | |
| `sentence-transformers/distiluse-base-multilingual-cased-v1` turns out to be the best candidate with the top-class inference speed and best contrastive loss. | |
| Check the evaluation notebook [here](https://github.com/fyang0507/my-notion-companion/blob/main/playground/evaluate_embedding_models.ipynb). | |
| """ | |
| ) | |
| df_embedding = pd.read_csv("resources/embedding_model_scores.csv", index_col=0) | |
| st.dataframe(df_embedding) | |
| st.markdown( | |
| """ | |
| #### Selecting the right Observability Tool | |
| Langchain ecosystem comes with its own [LangSmith](https://www.langchain.com/langsmith) observability tool. It works out of the box with minimal added configurations and requires no change in codes. | |
| LLM responses are somtimes unpredictable (especially a small 7B model, with multilingual capability), and it only gets more complex as we build the application as a LLM-chain. | |
| Below is a single observability trace recorded in LangSmith with a single query "θ°ζΎε¨ζ₯θ‘θ ιζεοΌδ»βεδ½βδΈζΎηζ‘γ" (Who plays in Indiana Pacers? Find the answer from Articles.) | |
| LangSmith helps organize the LLM calls and captures the I/O along the process, making the head-scratching debugging process much less misearble. | |
| """ | |
| ) | |
| st.video("resources/langsmith_walkthrough.mp4") | |
| st.markdown( | |
| """ | |
| #### Selecting the right UI | |
| [Streamlit](https://docs.streamlit.io/) and [Gradio](https://www.gradio.app/docs/) are among the popular options to share a LLM-based application. | |
| I chose Streamlit for its script-writing development experience and integrated webapp-like UI that supports multi-page app creation. | |
| """ | |
| ) | |
| st.markdown( | |
| """ | |
| #### Appendix: Project Working Log and Feature Tracker | |
| - [GitHub Homepage](https://github.com/fyang0507/my-notion-companion) | |
| - [Working Log](https://fredyang0507.notion.site/MyNotionCompanion-ce12513756784d2ab15015582538825e?pvs=4) | |
| - [Feature Tracker](https://fredyang0507.notion.site/306e21cfd9fa49b68f7160b2f6692f72?v=789f8ef443f44c96b7cc5f0c99a3a773&pvs=4) | |
| """ | |
| ) | |