On-device RAG system through the browser

Community Article Published October 29, 2025

Hi, I'm Rasmus, a PhD Student at the Technical University of Denmark and Laerdal Medical. My research is on on-device machine learning and model compression. Laerdal Medical is a world leading provider of training, educational and therapy products for lifesaving and emergency medical care.

In order to communicate my research to stakeholders in the company I'd like to build small apps that showcase edge/on-device AI that's relevant for Laerdal Medical. For example, we could offer an offline "chat with document"-system to communities with limited internet connectivity. That would provide a personalized learning experience for educational material for those who might need it the most.

Here's a video of how it looks on my phone. For demonstration purposes I turned on airplane mode to showcase that it works without internet connection:

Motivation

Finding the right context for a query is highly important for the quality of responses from modern AI systems. Information Retrieval is a classic computing task that is currently riding the deep learning wave in the shadows of Generative AI.

Representing data (images, text, audio) as vectors in high-dimensional spaces and semantically searching with similarity measures have become an increasingly common practise. A common use-case is the so called Retrieval Augmented Generation (RAG) paradigm, wherein a user queries a generative language model and relevant context is retrieved to augment the final output.

RAG have become so common that it's often found in various cloud services as one-click "Chat With Your Document" solutions. And it's easy to understand why. Oftentimes you want your interaction with a piece of information to be personalized to your specific needs. I don't often open up a textbook or a documentation to read from cover to cover. I open it to retrieve pieces of information that might help me solve the problem that I'm dealing with.

Simply put, the goal is to shorten the time between problem and solution by augmenting the information to the problem. Large Language Models (LLMs) do a wonderful job at just that.

But what if I don't want to use an LLMs on cloud services? What if your document contains private information that you don't want to upload it to third parties? That's where on-device models come in. By keeping everything on your device it can be 100% private.

You can even use it without an internet connection which might come in handy, who knows? Setting up constraints like these create interesting challenges. An additional constraint that I wanted to set up for myself was that this should be able to run on my phone through its Chrome browser.

Transformers.js and PGlite

I have wanted to build this idea out for some time. After seeing the individual components in Xenovas Transformers.js demos, specifically the ones on SmolLM on WebGPU and in-browser semantic search with PGlite, I knew that it was just a matter of stiching those two together.

The resulting app can be found as a HuggingFace Space.

Data considerations

Let's imagine that you've found a document that you would like to put into the RAG system. We'd want to preprocess that a tiny bit. Chunking the text will help to identify units of information that can be used to answer user queries.

As this is just a small demo to show off to coworkers and whoever else might be interested I wanted it to be rather simple. I prompted ChatGPT to generate 5 paragraphs of text along with 5 questions for each of those texts. Generating questions allows me to simplify the setup and control/understand the similarity scoring during the retrieval process a bit better.

Database initialization

The data had to be put into a database to be searchable. The semantic-search demo with PGlite showcases that it's possible to do entirely in the browser. The interface is super nice and simple:

import { PGlite } from '@electric-sql/pglite';
import { vector } from '@electric-sql/pglite/vector';

const db = new PGlite({extensions: { vector }});

await db.exec(`CREATE ...`)

await db.exec(`INSERT INTO ...`)

and with pgvector you can perform similarity search on columns of type vector. I precomputed vectors for the generated questions and put them into the database. With one row for each question, and duplicating the paragraph of context for each associated question, it was easy to have a vector database that works completely offline.

The search function follows pgvector notation syntax and looks like this:

export const search = async (db,embedding,match_threshold = 0.6, limit = 1,) => {
    const res = await db.query(`select * from embeddings
                                where embeddings.embedding <#> $1 < $2
                                order by embeddings.embedding <#> $1
                                limit $3;`,
                              [JSON.stringify(embedding), -Number(match_threshold), Number(limit)],);
    return res.rows;
};

Small Language Model

Transformers.js was crucial to allow for rapidly throwing the app together. Having a familiar and standardized interface to get going AI inference in the browser is incredible when you think about it. As mentioed I used the SmolLM on WebGPU demo and my initial testing was with the SmolLM2-1.7B-Instruct model as in the demo.

However, I quickly changed it to the much smaller 360M parameter varient due to memory constraints on my phone. I'd like to revisit this and put more effort into compressing the model to fit onto my phone but right now it was all about making it work.

It was incredibly easy to experiment with different model sizes and quantization methods. Simply changing the model_id and the dtype allowed me to find just the right fit for what would run on my phone.

class TextGenerationPipeline {
  static model_id = "HuggingFaceTB/SmolLM2-360M-Instruct";

  static async getInstance(progress_callback = null) {
    this.tokenizer ??= AutoTokenizer.from_pretrained(this.model_id, {
      progress_callback,
    });

    this.model ??= AutoModelForCausalLM.from_pretrained(this.model_id, {
      dtype: "q4",
      device: "webgpu",
      progress_callback,
    });
...

After changing the model the whole app worked really well! Sort of...

The language model generated answers at a whopping 4-5 tokens/second. Not a fantastic experience but an experience nonetheless.

What's next?

I had a few reasons to do this little project:

To create a small demo app that showcases my research area to coworkers
Getting familiar with tooling
Building intuition about edge hardware

and I think it checks all the boxes.

I also think a great "snapshot" in time. It'll be exciting to see what I can build in a year or so from now! In that time I hope to apply my research-to-come in model compression to allow for a better experience.

If you find this sort of stuff interesting please feel free to reach out on LinkedIn or Bluesky :)

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote