# LongMemEval Benchmark Explained: How We Measure AI Memory Quality > LongMemEval is the most demanding public benchmark for AI memory systems — 500 questions spanning three months of conversation history. Here is what it tests, how the two harnesses work, and where Feather DB lands (0.693 QA accuracy, beating full-context GPT-4o at 38× lower cost). - **Category**: Performance - **Read time**: 7 min read - **Date**: June 20, 2026 - **Author**: Ashwath (Founder, Feather DB) - **URL**: https://getfeather.store/theory/longmemeval-benchmark-explained --- Metric Score recall@1 0.874 recall@3 0.942 recall@5 0.974 recall@10 0.986 Total runtime 11 seconds API key required No recall@1 of 0.874 means BM25 alone surfaces the right chunk as the top result in 87% of queries. recall@10 of 0.986 means the chunk is somewhere in the first ten results for 98.6% of questions. This is useful for two reasons. First, it gives you a fast, free baseline to verify your chunking and indexing pipeline before spending money on embeddings. Second, it shows that for many production use cases, keyword retrieval is already strong — semantic search becomes the marginal improvement, not the foundation. Run it yourself: ```bash pip install feather-db python -m feather_db.bench.longmemeval --harness retrieval --retriever bm25 ``` --- ## Feather DB end-to-end: QA accuracy With the full pipeline — BM25+dense hybrid retrieval via RRF, adaptive decay scoring, and GPT-4o as the answerer — Feather DB scores **0.693** on LongMemEval QA accuracy. That means on 500 questions about facts from three months of conversation history, Feather DB answers **346 of them correctly**. System Answerer QA Accuracy Cost / 1K sessions **Feather DB** GPT-4o **0.693** **$7.50** Full-context GPT-4o (paper baseline) GPT-4o 0.640 $288.00 **Feather DB** Gemini-2.5-Flash **0.657** **~$2.40** Zep (graphiti) GPT-4o 0.712 — The number that matters most here is the comparison to full-context GPT-4o. The paper's baseline feeds the entire conversation history — every session, every turn — into GPT-4o's context window at query time. That is the naive approach, and it scores 0.640. Feather DB retrieves a small focused slice of that history and scores 0.693. **That is a +5.3 percentage point improvement, at 38× lower cost.** --- ## Why focused retrieval beats full context The intuition is straightforward. When you pass 90 days of conversation into a context window, the model's attention is split across hundreds of facts, conversations, and topics. The relevant fact has to compete with everything else in that window. When you retrieve a focused set of chunks — the three or four most relevant memory segments — the model's attention goes where it needs to go. Less noise, more signal. Feather DB adds one more layer on top of retrieval: adaptive decay scoring. Facts that were retrieved often recently are weighted higher than facts that have been dormant. The `half_life` parameter (default: 14 days for agent memory workloads) controls how quickly old facts fade from the top of scored results. This matters for the knowledge-update category — if a user corrected a preference last week, that update should dominate over the original preference from two months ago. ```python import feather_db cfg = feather_db.ScoringConfig(half_life=14.0, time_weight=0.4, min=0.0) results = db.search(query_vec, k=5, scoring=cfg) ``` --- ## What the score of 0.693 means in practice A few grounding points for interpreting this number: - **69.3% is not perfect.** Roughly 1 in 3 questions gets a wrong or incomplete answer. The benchmark is hard — temporal reasoning and knowledge updates are genuinely difficult for retrieval-based systems. - **It beats the naive ceiling.** Full-context GPT-4o — the approach where you just throw everything at the model and hope — scores 0.640. A system that stores and retrieves selectively is already winning on the hardest evaluation available. - **The cost gap is the real story.** $7.50 vs $288 per 1,000 sessions is not a marginal difference. At scale, that is the difference between a memory feature that is economically viable and one that is not. - **Gemini-Flash gets you to 0.657 at $2.40 per 1,000 sessions.** For latency-tolerant pipelines or cost-constrained applications, that is a strong option. --- ## What to watch when you run it yourself The benchmark harness is included in the `feather-db` package. When you run it against your own memory configuration, these are the numbers worth tracking: Metric What it tells you recall@1 How often the best chunk is ranked first — critical if you only pass k=1 to the LLM recall@5 Whether evidence is being found at a practical context budget recall@10 Upper bound on what your retriever can possibly deliver QA accuracy End-to-end correctness — the number that matters to users Retrieval latency (p50, p99) Whether the memory layer adds meaningful latency to the response path Cost per 1K sessions Whether the approach scales economically A gap between recall@10 and QA accuracy is usually an answerer problem — the evidence is being retrieved but the LLM is not using it correctly. A gap between recall@1 and recall@10 is a ranking problem — the right chunk is in the index but not being surfaced first. Both failures look identical in QA accuracy, but they have different fixes. --- ## Running the benchmark The full harness ships with `feather-db`. Two commands cover both evaluation modes: ```bash # Retrieval harness — BM25, no API key required python -m feather_db.bench.longmemeval --harness retrieval --retriever bm25 # QA accuracy harness — requires OPENAI_API_KEY or GOOGLE_API_KEY python -m feather_db.bench.longmemeval --harness qa --answerer gpt-4o ``` Results are written to `bench/results/longmemeval_{timestamp}.json` alongside the raw per-question scores. Every published number in this post was generated from those JSON files — the audit trail is reproducible. If you are evaluating a custom memory configuration — different chunking strategy, different half-life, different k — the harness accepts flags for all of those. Run it against your own setup before committing to a production configuration. --- ## What we are working on next The current weak spots are temporal reasoning (0.417–0.477) and the knowledge-update category where scores plateau at 0.714 regardless of which model is used as the answerer. Both suggest the problem is in the retrieval and scoring layer, not the LLM. We are exploring explicit temporal indexing and conflict-aware update handling as next steps. LongMemEval scores are reported in our [benchmark documentation](https://github.com/feather-store/feather/blob/master/docs/benchmarks/longmemeval.md) and updated with each major release. ]]> --- *This is the machine-readable mirror of the theory post at [getfeather.store/theory/longmemeval-benchmark-explained](https://getfeather.store/theory/longmemeval-benchmark-explained). For the full Feather DB documentation, see [getfeather.store/llms-full.txt](https://getfeather.store/llms-full.txt).*