Why We Chose FTS5 Over Embeddings for AI Memory
When we rewrote memory-mcp from Python to TypeScript, we made a controversial decision: drop vector embeddings entirely in favor of SQLite's FTS5. The result? 46MB less bloat, instant startup, and search that actually works better for our use case.
The Numbers
- 46MB saved - No more sentence-transformers model weight
- 30+ seconds → <1s startup - No model loading
- 1,500+ tokens saved per response (no embedding bloat)
- 88 tokens for hot context retrieval (tested)
The Embeddings Trap
Vector embeddings have become the default answer for anything involving search. Need to find similar documents? Embeddings. Semantic search? Embeddings. AI memory? Obviously embeddings.
The original Python version of memory-mcp followed this playbook:
sentence-transformers/all-MiniLM-L6-v2- 384 dimensions- In-memory cosine similarity using NumPy
- JSON storage with embedded vectors
- PyTorch as a dependency (yes, really)
It worked. But the costs were brutal:
- 46MB+ model weight downloaded on first run
- 30+ seconds cold start (loading the model)
- 2+ seconds latency reported by users
- Entire JSON file loaded into RAM
- No concurrent access - file locks everywhere
The ildunari Fork: Peak Complexity
Someone forked the original and tried to "fix" it by adding more infrastructure:
- Qdrant vector database
- NGINX load balancing (2 instances)
- Prometheus + Grafana monitoring
- Loki + Promtail logging
- Redis caching
- Kubernetes + Helm charts
For a personal memory tool. Running locally. With maybe 100-1,000 memories.
They learned an important lesson and documented it before archiving the project:
"After implementing and then removing the auto-capture feature, here is the correct understanding of how MCP works: Servers can only respond to requests, not initiate actions."
The fork was abandoned. Over-engineering doesn't survive contact with reality.
When Embeddings Actually Make Sense
Vector embeddings excel at specific problems:
| Use Case | Embeddings? | Why |
|---|---|---|
| Millions of documents | Yes | Can't brute force at scale |
| Cross-lingual search | Yes | Semantic meaning crosses language |
| Image/text similarity | Yes | Cross-modal requires embeddings |
| 100-1,000 memories | No | Keyword search is faster and simpler |
| Personal AI memory | No | You know what you're looking for |
| Local-first tools | No | 46MB model + startup cost kills UX |
Personal AI memory is firmly in the "No" category. You're not searching millions of documents. You're recalling dozens to hundreds of memories you created yourself.
FTS5: The Right Tool
SQLite's FTS5 (Full-Text Search 5) is built into SQLite. No external dependencies. It provides:
- BM25 ranking - The same algorithm behind Elasticsearch and Lucene
- Phrase queries - Search for "authentication flow" as a phrase
- Boolean operators - AND, OR, NOT
- Prefix matching - auth* matches authentication, authorize, etc.
- Column weights - Prioritize title matches over body matches
For memory-mcp, we built a hybrid scoring system:
score = 0.4 * relevance + 0.3 * importance + 0.2 * recency + 0.1 * frequencyThis means a highly relevant but older memory can still rank above a recent but tangentially related one. The weights are tunable, but these defaults work well.
The Token Budget Problem
Here's something embedding-based systems get wrong: they ignore token cost.
When Claude calls memory_recall, we need to return memories that fit within context limits. The old Python version would return:
- Memory content
- 384-dimension embedding vector (stringified)
- Full metadata
- Similarity scores
Result: 1,500+ tokens per response in some cases. Most of it useless to Claude.
The new version uses a 3-tier response system:
| Tier | Tokens | Content |
|---|---|---|
| Minimal | ~30 | Just the summary |
| Standard | ~200 | Summary + key context |
| Full | ~500 | Everything including metadata |
Hot context (the most relevant memories) tested at just 88 tokens. That's 17x more efficient than the embedding-bloated responses.
The Startup Cost Nobody Talks About
MCP servers need to start fast. Every time you restart Claude Desktop, every MCP server initializes. With the old Python version:
- Python interpreter starts (~500ms)
- Import sentence-transformers (~2s)
- Load the model into memory (~10-30s first time, ~5s cached)
- Finally ready to serve requests
With the TypeScript + FTS5 version:
- Node starts (~100ms)
- Open SQLite database (~10ms)
- Ready
Sub-second startup. No model downloading. No waiting.
What We Lost
To be fair, dropping embeddings does sacrifice some capabilities:
- Semantic similarity - "car" won't match "automobile" unless you explicitly store both
- Typo tolerance - "authenication" won't find "authentication"
- Cross-lingual - Can't search English memories with French queries
For personal AI memory, these tradeoffs are acceptable. You wrote the memories. You know roughly what words you used. And if you need semantic search at scale, use a dedicated solution like Pinecone or Qdrant.
The Architecture That Shipped
Here's what the final memory-mcp architecture looks like:
SQLite Database
├── memories (main table)
│ ├── id, content, summary
│ ├── importance, created_at
│ ├── access_count, last_accessed
│ └── tags (JSON array)
├── memories_fts (FTS5 virtual table)
│ └── Indexed: content, summary, tags
└── Hybrid scoring query
└── BM25 + importance + recency + frequencyThree tools. One database file. Zero external dependencies beyond better-sqlite3 (and we're migrating to Bun's built-in SQLite to eliminate even that).
When to Use What
Here's the decision tree we use:
Dataset size < 10K documents?
→ Use FTS5. It's simpler and faster.
Need semantic/cross-lingual search?
→ Use embeddings, but via an external service (Pinecone, Qdrant).
Local-first with no external deps?
→ FTS5 is the only sane choice.
Conclusion
The industry's default answer to search is "add embeddings." For large-scale semantic search, that's right. For personal AI memory with 100-1,000 items, it's over-engineering.
FTS5 gave us:
- 46MB less bloat
- 30x faster startup
- 17x more token-efficient responses
- Zero external dependencies
- Search that actually works for the use case
Sometimes simpler wins. This was one of those times.
Try memory-mcp
Persistent memory for Claude. FTS5-powered. Install in seconds.
npx @whenmoon-afk/memory-mcp