How to Reduce AI Hallucinations with Persistent Memory

Why LLMs hallucinate

A large language model doesn't know things. It has learned statistical patterns across trillions of tokens — patterns strong enough to produce confident, fluent, specific-sounding text. When the model encounters a question it has no reliable pattern for, it doesn't say "I don't know." It completes the pattern with the most plausible continuation it can construct.

That's hallucination: not randomness, not lying, but the model filling a knowledge gap with something that sounds like the right answer because it has the right shape.

The underlying cause is that the model's knowledge is frozen at training time. It has no mechanism to distinguish "I know this from training data" from "I'm generating something that fits the pattern." Both feel the same from the inside. Both arrive in the output at the same confidence level.

The two failure modes

Hallucinations in production systems fall into two distinct patterns:

No context — invents facts. The model has no grounding information for the query at all. It generates a plausible-sounding answer from parametric memory alone. The answer may be entirely fabricated, subtly wrong, or confidently out of date. This is the classic hallucination: the model fills a void with invention.
Stale context — uses outdated facts. The model has retrieved context, but the retrieved documents are old. A customer's account status changed last week. A product was discontinued last month. An API endpoint was deprecated. The retrieved context is real but no longer accurate — and the model has no way to know the difference. The output is grounded in fact that used to be true.

Most retrieval-augmented generation (RAG) systems address failure mode 1 by injecting documents before generation. Far fewer systems address failure mode 2 — which requires the retrieval layer itself to understand time, not just similarity.

How persistent memory helps

Persistent memory changes the generation equation: instead of asking the model to answer from parametric knowledge alone, the system retrieves verified, scored facts from an external store and injects them into the prompt before the model generates.

The model is no longer filling gaps from statistical patterns. It has specific facts, with specific sources, at specific timestamps. The generation task shifts from recall to synthesis: the model's job is to coherently explain what the memory layer found, not to invent what it thinks might be true.

This addresses both failure modes — but only if the memory layer knows which facts are current and which are stale. A flat vector store retrieves the most semantically similar chunks regardless of age. That can surface outdated context as confidently as fresh context. The retrieval layer needs to understand recency, not just relevance.

The grounding pattern

The standard grounding pattern with Feather DB:

import feather_db as fdb

db = fdb.DB.open("agent_memory.feather", dim=1536)

# 1. Embed the query
query_vec = embed(user_query)

# 2. Retrieve top-k scored facts
#    half_life=14 means facts age faster (2-week news cycle)
#    time_weight=0.4 means recency is weighted heavily alongside similarity
results = db.search(
    query_vec,
    k=10,
    half_life=14,
    time_weight=0.4
)

# 3. Inject into system prompt
facts = "\n".join([
    f"- {r.text} [source: {r.meta.get_attribute('source_url')}]"
    for r in results
])

system_prompt = f"""Answer based only on the following verified facts.
If the facts do not contain the answer, say you don't know.

FACTS:
{facts}
"""

# 4. Generate — model synthesizes from grounded context
response = llm.complete(system_prompt, user_query)

The critical instruction is "answer based only on the following verified facts." Without it, the model supplements the retrieved context with parametric knowledge — bringing hallucinations back in through the instruction gap.

The decay advantage

Static RAG retrieves by similarity score. Feather DB retrieves by a composite score that includes similarity, recency, and recall-based stickiness:

stickiness    = 1 + log(1 + recall_count)
effective_age = age_in_days / stickiness
recency       = 0.5 ^ (effective_age / half_life_days)
final_score   = ((1 - time_weight) × similarity
                 + time_weight × recency) × importance

The decay mechanism solves the stale-context problem structurally. A fact stored six months ago and never recalled since has a recency score near zero — it stops surfacing unless explicitly queried, regardless of how similar it is to the current question. A fact stored last week has a high recency score. A fact stored last month but recalled frequently has an effective age much lower than its calendar age, because stickiness compresses how fast it ages.

The result: stale facts fade out of retrieval results passively. Fresh facts remain strong. The system tends toward current context without any manual curation.

The LongMemEval evidence

LongMemEval (Xu et al., 2024 / ICLR 2025) is the standard end-to-end benchmark for long-term memory in chat assistants: 500 questions, each paired with a ~115K-token conversation haystack, testing information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention.

Configuration	LongMemEval_S Score
Feather DB + GPT-4o	0.693
Feather DB + Gemini 2.5-Flash	0.657
Full-context GPT-4o (paper "ceiling")	0.640
Naive vector RAG (Stella + GPT-4o)	~0.310

The critical comparison is the first and third rows. Full-context GPT-4o — dumping the entire 115K-token history into the model's context window — scores 0.640. Feather DB retrieval with 10 selected memory chunks scores 0.693. Persistent memory with adaptive scoring beats full-context by 5.3 percentage points.

The naive vector RAG result (0.310) is equally instructive. Retrieval without scoring — treating every stored chunk as equally fresh and equally important — performs worse than full-context. The memory architecture matters as much as the fact that retrieval happened at all.

Recall strengthening: important truths stay on top

The recall_count field in Feather DB's metadata is auto-incremented every time a node appears in a search result. This creates a feedback loop that works in the hallucination-reduction direction: facts that are retrieved often because they're relevant to frequent queries become stickier, which makes them even more likely to surface in future queries of the same type.

recall_count	stickiness	effective age rate
0	1.00	normal
5	2.79	36% of normal
20	4.09	24% of normal
50	5.02	20% of normal

A fact about a user's core preference — something that surfaces in almost every query — accumulates recall weight until it effectively never ages. The model always sees it. A one-time detail that never gets retrieved again fades away within its half-life window, preventing it from contaminating future answers with outdated context.

No manual tagging required. The retrieval pattern becomes the memory management signal.

Graph edges for fact provenance

When a memory store grows large, hallucinations can emerge from a subtler failure: the model correctly retrieves a fact but can't verify where it came from or whether a related fact supersedes it. The follows_from edge type in Feather DB's graph layer addresses this.

# Link an updated fact to the fact it supersedes
db.link(
    from_id=updated_fact_id,
    to_id=original_fact_id,
    rel_type="follows_from",
    weight=1.0
)

# At retrieval time, traverse the chain
chain = db.context_chain(query_vec, k=5, hops=2)

context_chain runs vector search to find the top-k seed nodes (hop=0), then performs BFS expansion across typed edges. A query that surfaces an updated fact automatically traverses the follows_from edge to retrieve the original, giving the model the full chain: what the system believed, and what superseded it. The model can then reason about the update rather than silently using the original as though it were still current.

This is particularly important for knowledge-update queries — questions that ask about facts that changed over time. The LongMemEval knowledge-updates axis scores identically (0.714) across model classes in Feather's benchmark results, which is a signature of a retrieval-side gap: no answerer model can compensate for memories that weren't surfaced with their update chain intact.

What doesn't help: retrieval without scoring

The naive-RAG result from LongMemEval (0.310) illustrates the failure mode precisely. A static vector store retrieves by cosine similarity alone. It has no concept of when a fact was stored, how often it's been useful, or whether it's been superseded. The most semantically similar chunks might be the oldest, most outdated entries in the store — embedded near the query because they discuss the same topic, not because they contain current information.

Retrieval without scoring doesn't reduce hallucinations — it changes their character. Instead of the model inventing facts from nothing, it invents with false confidence, grounded in a retrieved context that looks authoritative but is out of date. This can be harder to catch than pure fabrication because the form of the answer is correct even when the content is wrong.

The fix is not better embedding models. It's a scoring layer that understands time.

Production pattern: source attribution

In production, every fact stored in persistent memory should carry its origin as metadata. When the model generates from retrieved context, the source should be returned alongside the answer — not just so users can verify, but so the system can audit which sources are contributing to which outputs.

# At ingest time — store source with every fact
meta = fdb.Metadata(importance=0.85)
meta.set_attribute("source_url", "https://docs.example.com/api/v2/endpoints")
meta.set_attribute("source_doc", "API Reference v2.4")
meta.set_attribute("ingested_at", "2026-06-16")
db.add(id=fact_id, vec=embed(fact_text), meta=meta)

# At retrieval time — return source with result
results = db.search(query_vec, k=10, half_life=14, time_weight=0.4)
for r in results:
    print(r.text)
    print("  source:", r.meta.get_attribute("source_url"))
    print("  doc:   ", r.meta.get_attribute("source_doc"))

When the model generates an incorrect answer, source attribution tells you which retrieved fact caused it. If a retrieved chunk with a stale source URL caused a hallucination, you can remove that node, update its importance weight to 0, or simply note that the source document needs re-ingestion. Without attribution, debugging hallucination sources becomes archaeology — searching through the vector store without knowing where to look.

Source attribution also makes the grounding instruction credible to the model. "Answer based only on these facts — sources are listed" is a stronger constraint than "answer based only on these facts" because the model can see that each fact has a verifiable origin. The instruction becomes checkable rather than abstract.

Summary: the hallucination-reduction stack

Store facts with timestamps and source metadata. Every ingested node should know where it came from and when.
Retrieve with decay scoring. Cosine similarity alone surfaces the semantically closest chunks, not the most current ones. Time-weighted scoring surfaces fresh context.
Link updates with follows_from edges. Use context_chain to surface the update history alongside the fact, so the model reasons about changes rather than seeing snapshots.
Ground the prompt explicitly. Instruct the model to answer only from retrieved context. Without this instruction, retrieval is a suggestion rather than a constraint.
Return source attribution. Every answer should carry the sources that produced it — for users, for auditing, and for debugging when things go wrong.

The LongMemEval result (0.693 vs 0.640 full-context) is the end-to-end evidence that this stack works. Not just as an architectural preference, but as a measurable accuracy improvement over the alternative of giving the model everything and hoping it finds the right answer inside 115K tokens of noise.

Install: pip install feather-db · GitHub: github.com/feather-store/feather