GraphRAG WASM — Browser-Native Knowledge Graph RAG
A complete GraphRAG pipeline — document ingestion, knowledge-graph build, retrieval, and LLM synthesis — running entirely in the browser via WebAssembly. No server required (an optional local Ollama backend is supported).
Quick Start
rustup target add wasm32-unknown-unknown
cargo install trunk
cd graphrag-wasm
trunk serve # dev server on http://localhost:8080
trunk build --release # production bundle in dist/
The UI: a 3-column chat shell
The interface is a single Nordic-Minimal chat shell (no tabs, no DaisyUI — a flat
hand-written stylesheet). See Chat discussion.html for the
reference mockup the layout mirrors verbatim.
| Column | Contents |
|---|---|
| LeftRail | Brand, source documents, Flat/Hierarchy toggle, Build button |
| Stage | Active source header, the thread of question/answer turns, the composer input |
| RightRail | Per-query subgraph SVG, pipeline progress rows, mini-stats, reference cards |
Answers are streamed token-by-token; inline citations ([1], [2]…) link to
reference cards in the RightRail. The per-query subgraph unions the entities from the
top-K retrieved chunks and lays them out with a built-in force-directed layout.
How it works (end-to-end, in the browser)
- Document processing — chunking with configurable size/overlap.
- Entity extraction — rule-based / WebLLM-assisted extraction.
- Embeddings — ONNX Runtime Web (MiniLM-L6), run off the main thread
(
ort.env.wasm.proxy = true) so the UI never blocks during inference. - Knowledge graph — in-memory entities, chunks, and relationships.
- Retrieval — pure-Rust cosine similarity, top-K via
VectorIndex::search. - Synthesis — WebLLM (in-browser) or Ollama (local server); citations are post-processed and wired to reference cards.
Documents persist across reloads in IndexedDB (see src/persist.rs).
What comes from graphrag-core vs. reimplemented here
This crate is not a mock — it links graphrag-core (path dependency, wasm-safe
feature subset) and drives a real graphrag_core::GraphRAG instance: document ingestion
(add_document_from_text), the knowledge-graph types (Entity, Relationship), Leiden
community detection, and adaptive query routing all come straight from core.
The ML hot-path stages are reimplemented browser-side, because core’s native backends (Ollama HTTP, candle, the LLM extractors) do not run inside a browser:
| Stage | Source |
|---|---|
| Document ingestion, graph types, Leiden, adaptive routing | graphrag-core |
| Embeddings | wasm-side onnx_embedder.rs (ONNX Runtime Web / WebGPU, hash fallback) |
| Entity extraction | wasm-side entity_extractor.rs (WebLLM-assisted or rule-based) |
| Vector search | wasm-side vector_search.rs (pure-Rust cosine) |
Note:
src/lib.rsalso exposes a separatewasm_bindgenGraphRAGwrapper for direct JS use (new GraphRAG(384)+ pure vector search) — distinct fromgraphrag_core::GraphRAGdespite the shared name.
LLM backends: WebLLM vs Ollama
WebLLM (default) — 100% in-browser via WebGPU
import { UnifiedLlmClient } from './graphrag_wasm.js';
const llm = UnifiedLlmClient.withWebLLM("Phi-3-mini-4k-instruct-q4f16_1-MLC");
llm.setTemperature(0.7);
const answer = await llm.generate("What is GraphRAG?");
- ✅ Full privacy (no data leaves the browser), works offline after model download.
- First load downloads the model (~1–2 GB); needs a WebGPU-capable browser; small models only (1–3B).
WebLLM and ONNX inference both run in dedicated web workers
(webllm-worker.js + ORT’s proxy worker), keeping main-thread blocking under ~50 ms.
Ollama HTTP — local server, larger models
const llm = UnifiedLlmClient.withOllama("http://localhost:11434", "llama3.1:8b");
const answer = await llm.generate("What is GraphRAG?");
- ✅ 7B–70B+ models, better quality, full GPU (CUDA/Metal).
- Requires a running Ollama server + CORS:
ollama pull llama3.1:8b
OLLAMA_ORIGINS="http://localhost:8080" ollama serve
UnifiedLlmClient exposes the same generate / chat / checkAvailability API for both
backends, so switching is a one-line change.
Tech stack
| Component | Technology |
|---|---|
| UI | Leptos (reactive Rust) |
| Build | Trunk |
| Styling | flat Nordic-Minimal CSS (tailwind.css, no @tailwind directives) |
| Tokenizer | HuggingFace tokenizers (unstable_wasm) |
| Embeddings | ONNX Runtime Web (off-main-thread, optional WebGPU) |
| LLM | WebLLM (in-browser) or Ollama HTTP |
| Vector search | pure Rust (cosine similarity) |
| Storage | IndexedDB |
Project layout
graphrag-wasm/
├── src/
│ ├── main.rs # chat-shell UI (LeftRail / Stage / RightRail)
│ ├── components/
│ │ ├── chat_shell.rs # data types, citation parser, subgraph builder
│ │ └── force_layout.rs # force-directed subgraph layout
│ ├── webllm.rs # WebLLM client (+ web-worker engine)
│ ├── ollama_http.rs # Ollama HTTP client
│ ├── llm_provider.rs # UnifiedLlmClient abstraction
│ ├── onnx_embedder.rs # ONNX Runtime Web embeddings
│ ├── vector_search.rs # cosine similarity
│ └── persist.rs # IndexedDB persistence
├── webllm-worker.js # WebWorker MLC engine handler
├── index.html # entry point + ORT/WebLLM worker wiring
├── tailwind.css # flat stylesheet
└── Trunk.toml # build config
Browser support
Chrome/Edge 87+, Firefox 89+, Safari 15.2+ (incl. mobile). Requires WebAssembly + ES2020 modules; WebGPU is optional (accelerates embeddings/WebLLM when present).
Tests
A Playwright parity test (tests/playwright/chat_layout.sh) asserts the WASM SPA matches
the mockup on 19 shared selectors. Unit tests:
cargo test --target wasm32-unknown-unknown
License
See the main repository LICENSE.