Back to Theory
Performance8 min read · June 20, 2026

Feather DB BM25 + LongMemEval: New Retrieval Benchmarks

Feather DB ships a two-mode LongMemEval benchmark harness — a dependency-free retrieval harness and an end-to-end QA-accuracy harness. BM25 baseline hits recall@1=0.874 on 500 questions with no API key. End-to-end QA accuracy: 0.693 vs GPT-4o full-context 0.640, at 38× lower cost.

F
Feather DB
Engineering

What shipped

Feather DB's benchmark suite now includes a full LongMemEval harness with two independent modes. You can measure retrieval quality in isolation — no LLM, no API key, runs in 11 seconds — or run the complete end-to-end pipeline from ingestion through answer judgment. Both modes operate on the same 500-question LongMemEval_S dataset.

The harness previously required API credentials even to test basic retrieval, which blocked contributors and anyone who wanted to profile the index layer without spending on inference. That blocker is gone.

Two harness modes

Retrieval harness (no API key required)

The retrieval harness ingests the LongMemEval conversation histories, runs a retrieval pass for each of the 500 questions, and reports recall@k at k=1, 3, 5, and 10. No embedding API, no LLM judge — just the Feather index and the query.

BM25 baseline on the full 500-question set, run locally:

python -m bench run longmemeval --dataset s --mode retrieval --retriever bm25

Results:

MetricBM25 Baseline
recall@10.874
recall@30.942
recall@50.974
recall@100.986
Wall time (500 questions)11 s
API key requiredNone

The retrieval harness is the right tool for iterating on the index layer. When you change tokenization, BM25 parameters, or hybrid fusion weights, you want a signal in seconds — not a full LLM judge run that costs money and takes hours.

QA-accuracy harness (end-to-end metric)

The QA-accuracy harness runs the full pipeline: ingest conversation history → retrieve relevant memories at question time → pass retrieved context to an LLM answerer → have a separate LLM judge compare the answer against the gold label. The output is a single accuracy score over all 500 questions.

This is the right metric for production decisions. Recall@k can look excellent while the downstream answer quality is broken — for example, if retrieved snippets are relevant but too short to contain the specific fact the LLM needs. QA accuracy catches that. It is what the LongMemEval paper uses as its primary measure.

python -m bench run longmemeval --dataset s --mode qa \
    --embedder openai \
    --judge llm --judge-provider gemini --judge-model gemini-2.0-flash \
    --answerer-provider gemini --answerer-model gemini-2.5-flash \
    --decay-half-life 14 --decay-time-weight 0.4 --k 10

QA accuracy results

ConfigurationLongMemEval_S ScoreCost / full run
Feather DB + GPT-4o0.693~$7.50
Feather DB + Gemini 2.5-flash0.657~$2.40
Full-context GPT-4o (paper ceiling)0.640~$288 / 1K questions
Full-context GPT-4o-mini0.554(paper-reported)
Naive vector RAG (Stella + GPT-4o)~0.310(paper-reported)

Feather DB + GPT-4o scores 0.693 against the paper's full-context GPT-4o ceiling of 0.640 — a +5.3pp improvement — at approximately 38× lower cost per thousand questions ($7.50 vs $288). The retrieval pipeline retrieves 10 snippets per question rather than stuffing 115K tokens of conversation history into the model's context window. Less noise, lower cost, higher accuracy.

Why BM25 beats naive embedding retrieval at recall

The BM25 baseline reaching recall@1=0.874 surprises engineers who assume dense vectors dominate retrieval tasks. The explanation is structural.

LongMemEval questions are heavily anchored on specific facts: a user's stated preference, a date, a proper noun, a version number. Dense embedding models average meaning across a sequence. Two memories that carry different specific facts — "User set up the project on March 3rd" and "User set up the project on April 17th" — embed to nearly identical vectors because the semantic content is almost identical. BM25 term frequency scoring distinguishes them because the date tokens are rare within the corpus and carry high IDF weight.

The pattern generalizes across four categories of query where BM25 outperforms naive dense retrieval:

Query typeExampleWhy dense underperforms
Temporal anchors"What did I say on March 3rd?"Date tokens collapse in embedding space
Proper nouns"Alice's preference for Python"Name tokens averaged into semantic neighbors
Version / ID strings"v0.15.1 changelog"Version strings have near-zero semantic distinction
Rare technical terms"HNSW ef_construction parameter"Rare tokens diluted by surrounding context

In LongMemEval specifically, a large fraction of questions fall into the first two categories. This is why the BM25 baseline outperforms published naive RAG results by more than 2× at recall@1 (0.874 vs ~0.40 for standard dense retrieval on this dataset).

Feather DB's hybrid search uses Reciprocal Rank Fusion to combine BM25 and dense ANN results — taking the best of both. The BM25 index is stored inside the .feather file alongside the HNSW graph, updated incrementally on every add() call, with no separate indexing step.

New dependency-free chat client

The QA-accuracy harness required an LLM for both the answerer and the judge. Previously the harness used openai and google-generativeai as hard dependencies, which created a mismatch: the retrieval harness ran with zero dependencies, but the QA harness needed a full install.

The new chat client uses only Python's standard library urllib — the same pattern as embedders.py — to talk to provider APIs directly. It supports four providers with a single consistent interface:

from bench.chat import chat_completion

# Works with any supported provider — no SDK installed
response = chat_completion(
    provider="gemini",   # or "openai", "anthropic", "ollama"
    model="gemini-2.5-flash",
    messages=[
        {"role": "user", "content": "What is the user's preferred programming language?"}
    ]
)

Provider routing is determined at call time by the provider argument. Environment variable requirements per provider:

ProviderEnv varNotes
geminiGOOGLE_API_KEYSupports Gemini 2.5-flash, 2.0-flash, Pro
openaiOPENAI_API_KEYSupports GPT-4o, GPT-4o-mini, o3
anthropicANTHROPIC_API_KEYSupports Claude Sonnet, Haiku, Opus
ollamanoneLocal inference, no API key

The Ollama path means you can run the full QA-accuracy harness — retrieval and LLM judge — with zero external API costs, provided you have a local model running. The retrieval harness with BM25 already runs with no API key at all. Together, these make the full benchmark suite runnable in a completely offline environment.

Choosing which harness to run

The two modes serve different purposes in a development workflow:

ScenarioUseWhy
Tuning index parameters (k1, b, ef)Retrieval harness11s per run, no API cost
Evaluating a new embedderRetrieval harnessIsolates embedder quality from LLM answerer
Pre-release accuracy checkQA-accuracy harnessEnd-to-end signal; recall@k doesn't predict answer quality
Comparing answerer modelsQA-accuracy harnessSame retrieval, different LLM — clean A/B
CI smoke test (no API key)Retrieval harness + BM25Zero dependencies, deterministic, 11s

The recommended workflow for a parameter change: run the retrieval harness to confirm recall doesn't regress, then run the QA harness on a 100-question subset before committing to the full 500-question run.

Reproducing the BM25 baseline

pip install feather-db
git clone https://github.com/feather-store/feather && cd feather

# No API key needed — BM25 retrieval only
python -m bench run longmemeval --dataset s --mode retrieval --retriever bm25

# Results land in bench/results/ as JSON
# Summary table in bench/reports/latest.md

The dataset downloads automatically on first run. The full 500-question BM25 retrieval run completes in 11 seconds on a standard laptop. Per-question results are written as JSON — every recall number in this post is auditable by re-running the command.

What's measured by LongMemEval

LongMemEval (Xu et al., 2024 / ICLR 2025) evaluates long-term memory in chat assistants across five axes. Each of the 500 questions is paired with a conversation history of approximately 115K tokens spread across roughly 40 sessions, most of which are distractors. The five axes test qualitatively different memory abilities:

  • information-extraction: recall a fact the user or assistant stated at some point in the history
  • multi-session reasoning: synthesize facts that were stated in separate, non-adjacent sessions
  • temporal reasoning: answer time-anchored questions ("what did I say three weeks ago?")
  • knowledge-updates: track changes or contradictions over time (the user changed their mind)
  • abstention: correctly refuse to answer when the required information isn't in the history

The retrieval harness measures whether the relevant session was returned in the top-k results. The QA harness measures whether the LLM produced a correct final answer given what was retrieved. An answer that was retrieved correctly but summarized badly by the LLM still scores zero — which is why recall@k and QA accuracy are both necessary and neither is sufficient alone.

Resources

Found a number that doesn't reproduce? Open an issue with your result JSON — we mean it about the audit trail.

Feather DB is part of Hawky.ai — AI-native development tools.