# LongMemEval Benchmark Explained: How We Measure AI Memory Quality

> LongMemEval is the most demanding public benchmark for AI memory systems — 500 questions spanning three months of conversation history. Here is what it tests, how the two harnesses work, and where Feather DB lands (0.693 QA accuracy, beating full-context GPT-4o at 38× lower cost).

- **Category**: Performance
- **Read time**: 7 min read
- **Date**: June 20, 2026
- **Author**: Ashwath (Founder, Feather DB)
- **URL**: https://getfeather.store/theory/longmemeval-benchmark-explained

---

Metric
      Score
    
  
  
    
      recall@1
      0.874
    
    
      recall@3
      0.942
    
    
      recall@5
      0.974
    
    
      recall@10
      0.986
    
    
      Total runtime
      11 seconds
    
    
      API key required
      No
    
  

recall@1 of 0.874 means BM25 alone surfaces the right chunk as the top result in 87% of queries. recall@10 of 0.986 means the chunk is somewhere in the first ten results for 98.6% of questions.

This is useful for two reasons. First, it gives you a fast, free baseline to verify your chunking and indexing pipeline before spending money on embeddings. Second, it shows that for many production use cases, keyword retrieval is already strong — semantic search becomes the marginal improvement, not the foundation.

Run it yourself:

```bash
pip install feather-db
python -m feather_db.bench.longmemeval --harness retrieval --retriever bm25

```

---

## Feather DB end-to-end: QA accuracy

With the full pipeline — BM25+dense hybrid retrieval via RRF, adaptive decay scoring, and GPT-4o as the answerer — Feather DB scores **0.693** on LongMemEval QA accuracy.

That means on 500 questions about facts from three months of conversation history, Feather DB answers **346 of them correctly**.

  
    
      System
      Answerer
      QA Accuracy
      Cost / 1K sessions
    
  
  
    
      **Feather DB**
      GPT-4o
      **0.693**
      **$7.50**
    
    
      Full-context GPT-4o (paper baseline)
      GPT-4o
      0.640
      $288.00
    
    
      **Feather DB**
      Gemini-2.5-Flash
      **0.657**
      **~$2.40**
    
    
      Zep (graphiti)
      GPT-4o
      0.712
      —
    
  

The number that matters most here is the comparison to full-context GPT-4o. The paper's baseline feeds the entire conversation history — every session, every turn — into GPT-4o's context window at query time. That is the naive approach, and it scores 0.640.

Feather DB retrieves a small focused slice of that history and scores 0.693. **That is a +5.3 percentage point improvement, at 38× lower cost.**

---

## Why focused retrieval beats full context

The intuition is straightforward. When you pass 90 days of conversation into a context window, the model's attention is split across hundreds of facts, conversations, and topics. The relevant fact has to compete with everything else in that window.

When you retrieve a focused set of chunks — the three or four most relevant memory segments — the model's attention goes where it needs to go. Less noise, more signal.

Feather DB adds one more layer on top of retrieval: adaptive decay scoring. Facts that were retrieved often recently are weighted higher than facts that have been dormant. The `half_life` parameter (default: 14 days for agent memory workloads) controls how quickly old facts fade from the top of scored results. This matters for the knowledge-update category — if a user corrected a preference last week, that update should dominate over the original preference from two months ago.

```python
import feather_db

cfg = feather_db.ScoringConfig(half_life=14.0, time_weight=0.4, min=0.0)
results = db.search(query_vec, k=5, scoring=cfg)

```

---

## What the score of 0.693 means in practice

A few grounding points for interpreting this number:

  - **69.3% is not perfect.** Roughly 1 in 3 questions gets a wrong or incomplete answer. The benchmark is hard — temporal reasoning and knowledge updates are genuinely difficult for retrieval-based systems.

  - **It beats the naive ceiling.** Full-context GPT-4o — the approach where you just throw everything at the model and hope — scores 0.640. A system that stores and retrieves selectively is already winning on the hardest evaluation available.

  - **The cost gap is the real story.** $7.50 vs $288 per 1,000 sessions is not a marginal difference. At scale, that is the difference between a memory feature that is economically viable and one that is not.

  - **Gemini-Flash gets you to 0.657 at $2.40 per 1,000 sessions.** For latency-tolerant pipelines or cost-constrained applications, that is a strong option.

---

## What to watch when you run it yourself

The benchmark harness is included in the `feather-db` package. When you run it against your own memory configuration, these are the numbers worth tracking:

  
    
      Metric
      What it tells you
    
  
  
    
      recall@1
      How often the best chunk is ranked first — critical if you only pass k=1 to the LLM
    
    
      recall@5
      Whether evidence is being found at a practical context budget
    
    
      recall@10
      Upper bound on what your retriever can possibly deliver
    
    
      QA accuracy
      End-to-end correctness — the number that matters to users
    
    
      Retrieval latency (p50, p99)
      Whether the memory layer adds meaningful latency to the response path
    
    
      Cost per 1K sessions
      Whether the approach scales economically
    
  

A gap between recall@10 and QA accuracy is usually an answerer problem — the evidence is being retrieved but the LLM is not using it correctly. A gap between recall@1 and recall@10 is a ranking problem — the right chunk is in the index but not being surfaced first. Both failures look identical in QA accuracy, but they have different fixes.

---

## Running the benchmark

The full harness ships with `feather-db`. Two commands cover both evaluation modes:

```bash
# Retrieval harness — BM25, no API key required
python -m feather_db.bench.longmemeval --harness retrieval --retriever bm25

# QA accuracy harness — requires OPENAI_API_KEY or GOOGLE_API_KEY
python -m feather_db.bench.longmemeval --harness qa --answerer gpt-4o

```

Results are written to `bench/results/longmemeval_{timestamp}.json` alongside the raw per-question scores. Every published number in this post was generated from those JSON files — the audit trail is reproducible.

If you are evaluating a custom memory configuration — different chunking strategy, different half-life, different k — the harness accepts flags for all of those. Run it against your own setup before committing to a production configuration.

---

## What we are working on next

The current weak spots are temporal reasoning (0.417–0.477) and the knowledge-update category where scores plateau at 0.714 regardless of which model is used as the answerer. Both suggest the problem is in the retrieval and scoring layer, not the LLM. We are exploring explicit temporal indexing and conflict-aware update handling as next steps.

LongMemEval scores are reported in our [benchmark documentation](https://github.com/feather-store/feather/blob/master/docs/benchmarks/longmemeval.md) and updated with each major release.

]]>

---

*This is the machine-readable mirror of the theory post at [getfeather.store/theory/longmemeval-benchmark-explained](https://getfeather.store/theory/longmemeval-benchmark-explained). For the full Feather DB documentation, see [getfeather.store/llms-full.txt](https://getfeather.store/llms-full.txt).*