Feather DB BM25 + LongMemEval: New Retrieval Benchmarks
Feather DB ships a two-mode LongMemEval benchmark harness — a dependency-free retrieval harness and an end-to-end QA-accuracy harness. BM25 baseline hits recall@1=0.874 on 500 questions with no API key. End-to-end QA accuracy: 0.693 vs GPT-4o full-context 0.640, at 38× lower cost.
What shipped
Feather DB's benchmark suite now includes a full LongMemEval harness with two independent modes. You can measure retrieval quality in isolation — no LLM, no API key, runs in 11 seconds — or run the complete end-to-end pipeline from ingestion through answer judgment. Both modes operate on the same 500-question LongMemEval_S dataset.
The harness previously required API credentials even to test basic retrieval, which blocked contributors and anyone who wanted to profile the index layer without spending on inference. That blocker is gone.
Two harness modes
Retrieval harness (no API key required)
The retrieval harness ingests the LongMemEval conversation histories, runs a retrieval pass for each of the 500 questions, and reports recall@k at k=1, 3, 5, and 10. No embedding API, no LLM judge — just the Feather index and the query.
BM25 baseline on the full 500-question set, run locally:
python -m bench run longmemeval --dataset s --mode retrieval --retriever bm25
Results:
| Metric | BM25 Baseline |
|---|---|
| recall@1 | 0.874 |
| recall@3 | 0.942 |
| recall@5 | 0.974 |
| recall@10 | 0.986 |
| Wall time (500 questions) | 11 s |
| API key required | None |
The retrieval harness is the right tool for iterating on the index layer. When you change tokenization, BM25 parameters, or hybrid fusion weights, you want a signal in seconds — not a full LLM judge run that costs money and takes hours.
QA-accuracy harness (end-to-end metric)
The QA-accuracy harness runs the full pipeline: ingest conversation history → retrieve relevant memories at question time → pass retrieved context to an LLM answerer → have a separate LLM judge compare the answer against the gold label. The output is a single accuracy score over all 500 questions.
This is the right metric for production decisions. Recall@k can look excellent while the downstream answer quality is broken — for example, if retrieved snippets are relevant but too short to contain the specific fact the LLM needs. QA accuracy catches that. It is what the LongMemEval paper uses as its primary measure.
python -m bench run longmemeval --dataset s --mode qa \
--embedder openai \
--judge llm --judge-provider gemini --judge-model gemini-2.0-flash \
--answerer-provider gemini --answerer-model gemini-2.5-flash \
--decay-half-life 14 --decay-time-weight 0.4 --k 10
QA accuracy results
| Configuration | LongMemEval_S Score | Cost / full run |
|---|---|---|
| Feather DB + GPT-4o | 0.693 | ~$7.50 |
| Feather DB + Gemini 2.5-flash | 0.657 | ~$2.40 |
| Full-context GPT-4o (paper ceiling) | 0.640 | ~$288 / 1K questions |
| Full-context GPT-4o-mini | 0.554 | (paper-reported) |
| Naive vector RAG (Stella + GPT-4o) | ~0.310 | (paper-reported) |
Feather DB + GPT-4o scores 0.693 against the paper's full-context GPT-4o ceiling of 0.640 — a +5.3pp improvement — at approximately 38× lower cost per thousand questions ($7.50 vs $288). The retrieval pipeline retrieves 10 snippets per question rather than stuffing 115K tokens of conversation history into the model's context window. Less noise, lower cost, higher accuracy.
Why BM25 beats naive embedding retrieval at recall
The BM25 baseline reaching recall@1=0.874 surprises engineers who assume dense vectors dominate retrieval tasks. The explanation is structural.
LongMemEval questions are heavily anchored on specific facts: a user's stated preference, a date, a proper noun, a version number. Dense embedding models average meaning across a sequence. Two memories that carry different specific facts — "User set up the project on March 3rd" and "User set up the project on April 17th" — embed to nearly identical vectors because the semantic content is almost identical. BM25 term frequency scoring distinguishes them because the date tokens are rare within the corpus and carry high IDF weight.
The pattern generalizes across four categories of query where BM25 outperforms naive dense retrieval:
| Query type | Example | Why dense underperforms |
|---|---|---|
| Temporal anchors | "What did I say on March 3rd?" | Date tokens collapse in embedding space |
| Proper nouns | "Alice's preference for Python" | Name tokens averaged into semantic neighbors |
| Version / ID strings | "v0.15.1 changelog" | Version strings have near-zero semantic distinction |
| Rare technical terms | "HNSW ef_construction parameter" | Rare tokens diluted by surrounding context |
In LongMemEval specifically, a large fraction of questions fall into the first two categories. This is why the BM25 baseline outperforms published naive RAG results by more than 2× at recall@1 (0.874 vs ~0.40 for standard dense retrieval on this dataset).
Feather DB's hybrid search uses Reciprocal Rank Fusion to combine BM25 and dense ANN results — taking the best of both. The BM25 index is stored inside the .feather file alongside the HNSW graph, updated incrementally on every add() call, with no separate indexing step.
New dependency-free chat client
The QA-accuracy harness required an LLM for both the answerer and the judge. Previously the harness used openai and google-generativeai as hard dependencies, which created a mismatch: the retrieval harness ran with zero dependencies, but the QA harness needed a full install.
The new chat client uses only Python's standard library urllib — the same pattern as embedders.py — to talk to provider APIs directly. It supports four providers with a single consistent interface:
from bench.chat import chat_completion
# Works with any supported provider — no SDK installed
response = chat_completion(
provider="gemini", # or "openai", "anthropic", "ollama"
model="gemini-2.5-flash",
messages=[
{"role": "user", "content": "What is the user's preferred programming language?"}
]
)
Provider routing is determined at call time by the provider argument. Environment variable requirements per provider:
| Provider | Env var | Notes |
|---|---|---|
gemini | GOOGLE_API_KEY | Supports Gemini 2.5-flash, 2.0-flash, Pro |
openai | OPENAI_API_KEY | Supports GPT-4o, GPT-4o-mini, o3 |
anthropic | ANTHROPIC_API_KEY | Supports Claude Sonnet, Haiku, Opus |
ollama | none | Local inference, no API key |
The Ollama path means you can run the full QA-accuracy harness — retrieval and LLM judge — with zero external API costs, provided you have a local model running. The retrieval harness with BM25 already runs with no API key at all. Together, these make the full benchmark suite runnable in a completely offline environment.
Choosing which harness to run
The two modes serve different purposes in a development workflow:
| Scenario | Use | Why |
|---|---|---|
| Tuning index parameters (k1, b, ef) | Retrieval harness | 11s per run, no API cost |
| Evaluating a new embedder | Retrieval harness | Isolates embedder quality from LLM answerer |
| Pre-release accuracy check | QA-accuracy harness | End-to-end signal; recall@k doesn't predict answer quality |
| Comparing answerer models | QA-accuracy harness | Same retrieval, different LLM — clean A/B |
| CI smoke test (no API key) | Retrieval harness + BM25 | Zero dependencies, deterministic, 11s |
The recommended workflow for a parameter change: run the retrieval harness to confirm recall doesn't regress, then run the QA harness on a 100-question subset before committing to the full 500-question run.
Reproducing the BM25 baseline
pip install feather-db
git clone https://github.com/feather-store/feather && cd feather
# No API key needed — BM25 retrieval only
python -m bench run longmemeval --dataset s --mode retrieval --retriever bm25
# Results land in bench/results/ as JSON
# Summary table in bench/reports/latest.md
The dataset downloads automatically on first run. The full 500-question BM25 retrieval run completes in 11 seconds on a standard laptop. Per-question results are written as JSON — every recall number in this post is auditable by re-running the command.
What's measured by LongMemEval
LongMemEval (Xu et al., 2024 / ICLR 2025) evaluates long-term memory in chat assistants across five axes. Each of the 500 questions is paired with a conversation history of approximately 115K tokens spread across roughly 40 sessions, most of which are distractors. The five axes test qualitatively different memory abilities:
- information-extraction: recall a fact the user or assistant stated at some point in the history
- multi-session reasoning: synthesize facts that were stated in separate, non-adjacent sessions
- temporal reasoning: answer time-anchored questions ("what did I say three weeks ago?")
- knowledge-updates: track changes or contradictions over time (the user changed their mind)
- abstention: correctly refuse to answer when the required information isn't in the history
The retrieval harness measures whether the relevant session was returned in the top-k results. The QA harness measures whether the LLM produced a correct final answer given what was retrieved. An answer that was retrieved correctly but summarized badly by the LLM still scores zero — which is why recall@k and QA accuracy are both necessary and neither is sufficient alone.
Resources
- GitHub: github.com/feather-store/feather
- Install:
pip install feather-db - Benchmark harness:
bench/ - Per-run JSON results:
bench/results/ - LongMemEval paper: arxiv.org/abs/2410.10813
- Hybrid BM25 + dense search: How hybrid search works in Feather DB
Found a number that doesn't reproduce? Open an issue with your result JSON — we mean it about the audit trail.
Feather DB is part of Hawky.ai — AI-native development tools.