Why We Chose FTS5 Over Embeddings for AI Memory

When we rewrote memory-mcp from Python to TypeScript, we made a controversial decision: drop vector embeddings entirely in favor of SQLite's FTS5. The result? 46MB less bloat, instant startup, and search that actually works better for our use case.

The Numbers

46MB saved - No more sentence-transformers model weight
30+ seconds → <1s startup - No model loading
1,500+ tokens saved per response (no embedding bloat)
88 tokens for hot context retrieval (tested)

The Embeddings Trap

Vector embeddings have become the default answer for anything involving search. Need to find similar documents? Embeddings. Semantic search? Embeddings. AI memory? Obviously embeddings.

The original Python version of memory-mcp followed this playbook:

sentence-transformers/all-MiniLM-L6-v2 - 384 dimensions
In-memory cosine similarity using NumPy
JSON storage with embedded vectors
PyTorch as a dependency (yes, really)

It worked. But the costs were brutal:

46MB+ model weight downloaded on first run
30+ seconds cold start (loading the model)
2+ seconds latency reported by users
Entire JSON file loaded into RAM
No concurrent access - file locks everywhere

The ildunari Fork: Peak Complexity

Someone forked the original and tried to "fix" it by adding more infrastructure:

Qdrant vector database
NGINX load balancing (2 instances)
Prometheus + Grafana monitoring
Loki + Promtail logging
Redis caching
Kubernetes + Helm charts

For a personal memory tool. Running locally. With maybe 100-1,000 memories.

They learned an important lesson and documented it before archiving the project:

"After implementing and then removing the auto-capture feature, here is the correct understanding of how MCP works: Servers can only respond to requests, not initiate actions."

The fork was abandoned. Over-engineering doesn't survive contact with reality.

When Embeddings Actually Make Sense

Vector embeddings excel at specific problems:

Use Case	Embeddings?	Why
Millions of documents	Yes	Can't brute force at scale
Cross-lingual search	Yes	Semantic meaning crosses language
Image/text similarity	Yes	Cross-modal requires embeddings
100-1,000 memories	No	Keyword search is faster and simpler
Personal AI memory	No	You know what you're looking for
Local-first tools	No	46MB model + startup cost kills UX

Personal AI memory is firmly in the "No" category. You're not searching millions of documents. You're recalling dozens to hundreds of memories you created yourself.

FTS5: The Right Tool

SQLite's FTS5 (Full-Text Search 5) is built into SQLite. No external dependencies. It provides:

BM25 ranking - The same algorithm behind Elasticsearch and Lucene
Phrase queries - Search for "authentication flow" as a phrase
Boolean operators - AND, OR, NOT
Prefix matching - auth* matches authentication, authorize, etc.
Column weights - Prioritize title matches over body matches

For memory-mcp, we built a hybrid scoring system:

score = 0.4 * relevance + 0.3 * importance + 0.2 * recency + 0.1 * frequency

This means a highly relevant but older memory can still rank above a recent but tangentially related one. The weights are tunable, but these defaults work well.

The Token Budget Problem

Here's something embedding-based systems get wrong: they ignore token cost.

When Claude calls memory_recall, we need to return memories that fit within context limits. The old Python version would return:

Memory content
384-dimension embedding vector (stringified)
Full metadata
Similarity scores

Result: 1,500+ tokens per response in some cases. Most of it useless to Claude.

The new version uses a 3-tier response system:

Tier	Tokens	Content
Minimal	~30	Just the summary
Standard	~200	Summary + key context
Full	~500	Everything including metadata

Hot context (the most relevant memories) tested at just 88 tokens. That's 17x more efficient than the embedding-bloated responses.

The Startup Cost Nobody Talks About

MCP servers need to start fast. Every time you restart Claude Desktop, every MCP server initializes. With the old Python version:

Python interpreter starts (~500ms)
Import sentence-transformers (~2s)
Load the model into memory (~10-30s first time, ~5s cached)
Finally ready to serve requests

With the TypeScript + FTS5 version:

Node starts (~100ms)
Open SQLite database (~10ms)
Ready

Sub-second startup. No model downloading. No waiting.

What We Lost

To be fair, dropping embeddings does sacrifice some capabilities:

Semantic similarity - "car" won't match "automobile" unless you explicitly store both
Typo tolerance - "authenication" won't find "authentication"
Cross-lingual - Can't search English memories with French queries

For personal AI memory, these tradeoffs are acceptable. You wrote the memories. You know roughly what words you used. And if you need semantic search at scale, use a dedicated solution like Pinecone or Qdrant.

The Architecture That Shipped

Here's what the final memory-mcp architecture looks like:

SQLite Database
├── memories (main table)
│   ├── id, content, summary
│   ├── importance, created_at
│   ├── access_count, last_accessed
│   └── tags (JSON array)
├── memories_fts (FTS5 virtual table)
│   └── Indexed: content, summary, tags
└── Hybrid scoring query
    └── BM25 + importance + recency + frequency

Three tools. One database file. Zero external dependencies beyond better-sqlite3 (and we're migrating to Bun's built-in SQLite to eliminate even that).

When to Use What

Here's the decision tree we use:

Dataset size < 10K documents?

→ Use FTS5. It's simpler and faster.

Need semantic/cross-lingual search?

→ Use embeddings, but via an external service (Pinecone, Qdrant).

Local-first with no external deps?

→ FTS5 is the only sane choice.

Conclusion

The industry's default answer to search is "add embeddings." For large-scale semantic search, that's right. For personal AI memory with 100-1,000 items, it's over-engineering.

FTS5 gave us:

46MB less bloat
30x faster startup
17x more token-efficient responses
Zero external dependencies
Search that actually works for the use case

Sometimes simpler wins. This was one of those times.

End the Amnesia Loop

Your agent deserves to remember. FTS5-powered persistent memory. Install in seconds.

npx @whenmoon-afk/memory-mcp

View on GitHub Learn More

The Numbers

46MB saved - No more sentence-transformers model weight
30+ seconds → <1s startup - No model loading
1,500+ tokens saved per response (no embedding bloat)
88 tokens for hot context retrieval (tested)

The Embeddings Trap

Vector embeddings have become the default answer for anything involving search. Need to find similar documents? Embeddings. Semantic search? Embeddings. AI memory? Obviously embeddings.

The original Python version of memory-mcp followed this playbook:

sentence-transformers/all-MiniLM-L6-v2 - 384 dimensions
In-memory cosine similarity using NumPy
JSON storage with embedded vectors
PyTorch as a dependency (yes, really)

It worked. But the costs were brutal:

46MB+ model weight downloaded on first run
30+ seconds cold start (loading the model)
2+ seconds latency reported by users
Entire JSON file loaded into RAM
No concurrent access - file locks everywhere

The ildunari Fork: Peak Complexity

Someone forked the original and tried to "fix" it by adding more infrastructure:

Qdrant vector database
NGINX load balancing (2 instances)
Prometheus + Grafana monitoring
Loki + Promtail logging
Redis caching
Kubernetes + Helm charts

For a personal memory tool. Running locally. With maybe 100-1,000 memories.

They learned an important lesson and documented it before archiving the project:

"After implementing and then removing the auto-capture feature, here is the correct understanding of how MCP works: Servers can only respond to requests, not initiate actions."

The fork was abandoned. Over-engineering doesn't survive contact with reality.

When Embeddings Actually Make Sense

Vector embeddings excel at specific problems:

Use Case	Embeddings?	Why
Millions of documents	Yes	Can't brute force at scale
Cross-lingual search	Yes	Semantic meaning crosses language
Image/text similarity	Yes	Cross-modal requires embeddings
100-1,000 memories	No	Keyword search is faster and simpler
Personal AI memory	No	You know what you're looking for
Local-first tools	No	46MB model + startup cost kills UX

Personal AI memory is firmly in the "No" category. You're not searching millions of documents. You're recalling dozens to hundreds of memories you created yourself.

FTS5: The Right Tool

SQLite's FTS5 (Full-Text Search 5) is built into SQLite. No external dependencies. It provides:

BM25 ranking - The same algorithm behind Elasticsearch and Lucene
Phrase queries - Search for "authentication flow" as a phrase
Boolean operators - AND, OR, NOT
Prefix matching - auth* matches authentication, authorize, etc.
Column weights - Prioritize title matches over body matches

For memory-mcp, we built a hybrid scoring system:

score = 0.4 * relevance + 0.3 * importance + 0.2 * recency + 0.1 * frequency

This means a highly relevant but older memory can still rank above a recent but tangentially related one. The weights are tunable, but these defaults work well.

The Token Budget Problem

Here's something embedding-based systems get wrong: they ignore token cost.

When Claude calls memory_recall, we need to return memories that fit within context limits. The old Python version would return:

Memory content
384-dimension embedding vector (stringified)
Full metadata
Similarity scores

Result: 1,500+ tokens per response in some cases. Most of it useless to Claude.

The new version uses a 3-tier response system:

Tier	Tokens	Content
Minimal	~30	Just the summary
Standard	~200	Summary + key context
Full	~500	Everything including metadata

Hot context (the most relevant memories) tested at just 88 tokens. That's 17x more efficient than the embedding-bloated responses.

The Startup Cost Nobody Talks About

MCP servers need to start fast. Every time you restart Claude Desktop, every MCP server initializes. With the old Python version:

Python interpreter starts (~500ms)
Import sentence-transformers (~2s)
Load the model into memory (~10-30s first time, ~5s cached)
Finally ready to serve requests

With the TypeScript + FTS5 version:

Node starts (~100ms)
Open SQLite database (~10ms)
Ready

Sub-second startup. No model downloading. No waiting.

What We Lost

To be fair, dropping embeddings does sacrifice some capabilities:

Semantic similarity - "car" won't match "automobile" unless you explicitly store both
Typo tolerance - "authenication" won't find "authentication"
Cross-lingual - Can't search English memories with French queries

The Architecture That Shipped

Here's what the final memory-mcp architecture looks like:

SQLite Database
├── memories (main table)
│   ├── id, content, summary
│   ├── importance, created_at
│   ├── access_count, last_accessed
│   └── tags (JSON array)
├── memories_fts (FTS5 virtual table)
│   └── Indexed: content, summary, tags
└── Hybrid scoring query
    └── BM25 + importance + recency + frequency

Three tools. One database file. Zero external dependencies beyond better-sqlite3 (and we're migrating to Bun's built-in SQLite to eliminate even that).

When to Use What

Here's the decision tree we use:

Dataset size < 10K documents?

→ Use FTS5. It's simpler and faster.

Need semantic/cross-lingual search?

→ Use embeddings, but via an external service (Pinecone, Qdrant).

Local-first with no external deps?

→ FTS5 is the only sane choice.

Conclusion

The industry's default answer to search is "add embeddings." For large-scale semantic search, that's right. For personal AI memory with 100-1,000 items, it's over-engineering.

FTS5 gave us:

46MB less bloat
30x faster startup
17x more token-efficient responses
Zero external dependencies
Search that actually works for the use case

Sometimes simpler wins. This was one of those times.

End the Amnesia Loop

Your agent deserves to remember. FTS5-powered persistent memory. Install in seconds.

npx @whenmoon-afk/memory-mcp

View on GitHub Learn More

Why We Chose FTS5 Over Embeddings for AI Memory

The Numbers

The Embeddings Trap

The ildunari Fork: Peak Complexity

When Embeddings Actually Make Sense

FTS5: The Right Tool

The Token Budget Problem

The Startup Cost Nobody Talks About

What We Lost

The Architecture That Shipped

When to Use What

Conclusion

End the Amnesia Loop

Related Posts

Why We Chose FTS5 Over Embeddings for AI Memory

The Numbers

The Embeddings Trap

The ildunari Fork: Peak Complexity

When Embeddings Actually Make Sense

FTS5: The Right Tool

The Token Budget Problem

The Startup Cost Nobody Talks About

What We Lost

The Architecture That Shipped

When to Use What

Conclusion

End the Amnesia Loop

Related Posts