Context Engine vs. Full Context Window: The $40 vs $1 Decision

The full-context approach and its cost

The simplest approach to AI memory is full-context: stuff the entire conversation history, all documents, and all prior interactions into the context window. Modern frontier models — GPT-4o (128K), Claude Sonnet (200K), Gemini 1.5 Pro (1M) — have context windows large enough to fit months of conversation history.

The problem is cost. At GPT-4o pricing ($0.0025 per 1K input tokens), a 115K-token context window costs $0.29 per query. That's $290 per 1,000 queries, or $8,700 per month at 30,000 queries/day. For most production AI applications, this is prohibitive.

What LongMemEval shows

The LongMemEval benchmark (Xu et al., 2024) tests exactly this trade-off. Each of its 500 questions pairs with a ~115K-token conversation haystack. The paper reports the full-context GPT-4o baseline at 0.640 — giving the model everything and asking it to find the answer.

Feather DB's retrieval pipeline — selecting 10 relevant memory chunks from the same haystack — scores 0.693 with GPT-4o. Higher accuracy. 40× cheaper per query.

Approach	Score	Input tokens/query	Cost/1K queries (GPT-4o)
Full-context GPT-4o	0.640	~115K	~$288
Feather DB + GPT-4o	0.693	~3K	~$7.50
Feather DB + Gemini-Flash	0.657	~3K	~$0.25

The Gemini-Flash line is the most striking: 0.657 accuracy (higher than full-context GPT-4o) for $0.25 per 1,000 queries. That's a 99.9% cost reduction from the full-context GPT-4o baseline.

Why retrieval beats full-context

This result seems counterintuitive — surely more context is better? The reasons retrieval wins are mechanical:

Signal-to-noise: A 115K-token history is mostly distractor content. The model has to read and attend over everything to find the answer. 10 retrieved chunks give the model the answer with minimal distraction.
Attention quality: Transformer attention is not uniform. Frontier models are known to underweight information in the middle of long contexts (the "lost in the middle" phenomenon). Short contexts with relevant content don't suffer from this.
Adaptive scoring: Feather's retrieval is not just cosine similarity — it incorporates recency, recall-based stickiness, and importance weights. The 10 chunks it selects are more likely to be relevant than the top-10 from pure embedding similarity.

Where full-context still wins

Full-context GPT-4o leads in two specific scenarios from the LongMemEval breakdown:

Temporal reasoning: "What did I say three weeks ago about X?" — questions requiring precise time-anchoring. Feather scores 0.477 vs full-context's presumed higher baseline on this axis.
Knowledge updates: Tracking contradictions ("I used to use Python 2, now I use Python 3"). Feather scores 0.714 regardless of model — a retrieval-side ceiling that extractors (v0.9.0) are designed to address.

For everything else — information extraction, multi-session reasoning, preference queries — retrieval is competitive or superior.

The practical decision

For production AI applications with 10k+ queries/day, the math is unambiguous. Full-context pricing doesn't scale. A context engine that retrieves 5–10 relevant chunks:

Reduces token costs by 30–40×
Reduces latency (smaller prompts → faster responses)
Delivers comparable or better accuracy on most question types
Scales with user base without linear cost growth

The question isn't "retrieval or full-context" — it's "which retrieval system." A flat vector store retrieves well for static corpora. A context engine retrieves better over time because high-value memories become stickier and low-value ones decay.

Getting started

pip install feather-db

The full benchmark reproduction command is in the Feather repo bench/ directory. Every number in this post is auditable from the per-question JSON results in bench/results/.