# You don't need GPT-4o full-context for AI memory — Feather DB beats it for $2.40 > Feather DB v0.8.0 + GPT-4o scores 0.693 on LongMemEval_S, beating the LongMemEval paper's full-context GPT-4o ceiling (0.640). Cheap-tier with Gemini-Flash hits 0.657 for $2.40 per benchmark run. - **Category**: Performance - **Read time**: 6 min read - **Date**: April 26, 2026 - **Author**: Hawky.ai (Feather DB Team) - **URL**: https://getfeather.store/theory/longmemeval-results-april-2026 --- # You don't need GPT-4o full-context for AI memory — Feather DB beats it for $2.40 *Published: April 2026 · Hawky.ai · [github.com/feather-store/feather](https://github.com/feather-store/feather)* --- ## TL;DR We just ran the [LongMemEval](https://arxiv.org/abs/2410.10813) benchmark — the standard 500-question evaluation for long-term memory in chat assistants — on **Feather DB v0.8.0**, our embedded vector database. Two runs, same retrieval pipeline, two different answerer models: Configuration LongMemEval_S Score Cost / full run Wall time **Feather DB + GPT-4o** **0.693** ~$8 4.5 hours **Feather DB + Gemini 2.5-flash** **0.657** ~$2.40 4.5 hours Full-context GPT-4o (paper "ceiling") 0.640 (paper-reported) — Full-context GPT-4o-mini 0.554 (paper-reported) — Naive vector RAG (Stella + GPT-4o) ~0.31 (paper-reported) — Feather DB **beats the LongMemEval paper's full-context GPT-4o ceiling** — meaning a 10-snippet retrieval from a single `.feather` file delivers more useful signal to the answerer than dumping the entire 115K-token chat history into a frontier model. The whole pipeline runs in-process. There's no service to host. The reproduce command is one shell line. The total benchmark cost — embeddings + answer generation + LLM judge for all 500 questions — is the price of an espresso. Below: what this means, the per-axis breakdown, and how to run it yourself. --- ## What LongMemEval actually measures LongMemEval (Xu et al., 2024 / ICLR 2025) is the standard end-to-end benchmark for memory in chat assistants. 500 questions, each paired with a long conversation history (~115K tokens, ~40 sessions of which most are distractors). Each question tests one of five memory abilities: - **information-extraction** — recall a fact stated by user or assistant - **multi-session reasoning** — synthesize facts across distant sessions - **temporal reasoning** — answer time-anchored questions ("what did I say three weeks ago?") - **knowledge-updates** — track changes / contradictions over time - **abstention** — refuse to answer when info isn't there The eval is run end-to-end: ingest the haystack, retrieve relevant memories at question time, hand them to an LLM to answer, and have a separate LLM judge correctness against gold. **It only measures whether the assistant gave the right answer** — perfect retrieval that doesn't translate to a correct answer scores zero. This is the right test of "does memory work" because every other metric (recall@k, NDCG, etc.) can look great while the actual user experience is broken. --- ## The headline result, per-axis Same retrieval pipeline (Feather + Azure `text-embedding-3-small` + adaptive temporal decay), two answerer models: Axis Feather + Gemini-Flash Feather + GPT-4o information-extraction 0.896 **0.942** knowledge-updates 0.714 0.714 multi-session-reasoning 0.583 0.606 temporal-reasoning 0.417 0.477 **overall** **0.657** **0.693** By question_type: Type Gemini-Flash GPT-4o single-session-user 0.941 **1.000** *(perfect)* single-session-assistant 0.964 0.964 single-session-preference 0.667 0.767 knowledge-update 0.714 0.714 multi-session 0.583 0.606 temporal-reasoning 0.417 0.477 Three things worth noticing: - **Switching to GPT-4o lifts the overall by +3.6pp**, but the lift is *concentrated*: temporal-reasoning +6, single-session-preference +10, single-session-user +5.9. Other axes are flat or marginal. - **Knowledge-update is identical across model classes** (0.714). That's a fingerprint of a *retrieval-side* gap that no answerer can fix — a richer answer model can't make up for memories that weren't surfaced. - **Single-session-user hits 100%** with GPT-4o. Combined with single-session-assistant at 96.4%, simple recall is essentially solved. --- ## The headline-headline: we beat full-context GPT-4o The LongMemEval paper reports **full-context GPT-4o + Chain-of-Note + JSON output = 0.640** on the same dataset. Their setup: stuff the entire 115K-token chat history into GPT-4o's context window, ask the question, score the answer. Feather + GPT-4o = **0.693** with a 10-snippet retrieval. **+5.3pp over the paper's full-context ceiling.** What this means: our retrieval pipeline isn't just *acceptable* given a long-context-capable model — it's actually *better than handing the model everything*. The reasons are mechanical: - **Less noise**: 10 carefully-selected memories vs 115K tokens of distractors. - **Lower input cost**: ~3K tokens to GPT-4o per question vs ~115K. **40× cheaper per query.** - **Lower latency**: smaller prompts = faster responses, regardless of model. - **Frontier-model ceiling stops mattering**: Feather + Flash beats Full-context + GPT-4o-mini (0.554 paper-reported). You can pick the cheapest model that the *retrieval* hands enough context to. Retrieval has been an evergreen idea since RAG. What's new is that **the retrieval is now in-process, file-based, sub-millisecond, and free.** --- ## Reproducing this The whole pipeline, one shell command: ```bash pip install feather-db git clone https://github.com/feather-store/feather && cd feather # Set your credentials export AZURE_OPENAI_ENDPOINT="https://.openai.azure.com/" export AZURE_OPENAI_API_KEY="..." export AZURE_OPENAI_DEPLOYMENT="text-embedding-3-small" export AZURE_OPENAI_API_VERSION="2023-05-15" export GOOGLE_API_KEY="..." # for the judge # Run on LongMemEval_S — 500 questions, ~$2.40, ~4.5 hours python -m bench run longmemeval --dataset s --limit 0 \ --embedder openai \ --judge llm --judge-provider gemini --judge-model gemini-2.0-flash \ --answerer-provider gemini --answerer-model gemini-2.5-flash \ --decay-half-life 14 --decay-time-weight 0.4 --k 10 ``` That's the entire benchmark. The dataset auto-downloads on first run, the result lands as a JSON in `bench/results/`, and a Markdown rolled-up table sits in `bench/reports/latest.md`. To run with GPT-4o instead, swap the answerer flags: ```bash --answerer-provider azure --answerer-model gpt-4o-feather ``` (Plus the appropriate `AZURE_OPENAI_CHAT_*` env vars for the chat deployment.) Per-question JSON results are checked into the repo at `bench/results/`. Every number in this post is auditable. If you re-run on different models and get different numbers, please open an issue with the JSON. --- ## What Feather DB actually is [Feather DB](https://github.com/feather-store/feather) is an **embedded** vector database written in C++17 with Python and Rust bindings. Single binary `.feather` file. In-process with your code. No server. No infrastructure to stand up. Designed specifically for AI long-term memory: - **HNSW** for sub-ms ANN — p50 = 0.19 ms, recall@10 = 0.972 on 500K × 128-dim SIFT data - **BM25 hybrid search** via Reciprocal Rank Fusion — handles paraphrase + exact terms - **Adaptive Temporal Decay** — recall-count-adjusted half-life for "stickiness" of memories - **Typed weighted edges + graph traversal** — context chains across sessions - **Namespaces / entities / attribute filters** — multi-tenant friendly - **Multimodal pockets** — text, visual, audio in one DB MIT-licensed. `pip install feather-db`. v0.8.0 just shipped. --- ## What's coming next Two threads pull on the result above: **Knowledge-update is unchanged across model classes.** That tells us where the next investment should go. Feather already stores raw chat turns; competitor memory layers extract atomic facts at ingest time and resolve contradictions automatically. We're shipping **`feather_db.extractors`** in v0.9.0 to do the same — pluggable LLM-based fact extraction, ontology-aware edges, contradiction resolution. Expected to specifically lift knowledge-updates and multi-session-reasoning. **Same engine, different surface.** Feather DB is embedded today. We're building a managed cloud version for teams that want the same engine, same file format, but a hosted API surface they can call over HTTPS. Coming Q3 2026 — your `.feather` file stays portable, your data stays yours, only the deployment topology changes. **Cloud waitlist is open.** Drop your email below and we'll ping when there's a beta to try. → **[Join the Cloud waitlist](https://www.getfeather.store/cloud)** ← --- ## Resources - **GitHub**: [github.com/feather-store/feather](https://github.com/feather-store/feather) - **PyPI**: `pip install feather-db` - **Crates.io**: `feather-db-cli` - **Detailed report** (config, all per-axis numbers, caveats): [`docs/benchmarks/longmemeval.md`](https://github.com/feather-store/feather/blob/master/docs/benchmarks/longmemeval.md) - **arXiv paper** (now includes the §4.7 LongMemEval section): [`docs/featherdb_paper.pdf`](https://github.com/feather-store/feather/blob/master/docs/featherdb_paper.pdf) - **Reproducible benchmark harness**: [`bench/`](https://github.com/feather-store/feather/tree/master/bench) - **Per-run JSON results** (audit trail): [`bench/results/`](https://github.com/feather-store/feather/tree/master/bench/results) --- *Found this useful? We'd love your star on [GitHub](https://github.com/feather-store/feather) and your feedback on whatever you build with it. Found something wrong? Open an issue with the JSON — we mean it about the audit trail.* *Feather DB is part of [Hawky.ai](https://hawky.ai) — AI-native development tools.* --- *This is the machine-readable mirror of the theory post at [getfeather.store/theory/longmemeval-results-april-2026](https://getfeather.store/theory/longmemeval-results-april-2026). For the full Feather DB documentation, see [getfeather.store/llms-full.txt](https://getfeather.store/llms-full.txt).*