Feather DB + OpenAI Agents SDK: Persistent Memory for GPT Agents

The statefulness gap in the OpenAI Agents SDK

The OpenAI Agents SDK (openai-agents package) ships with clean primitives: function-calling tools via @function_tool, agent handoffs via handoff(), and a Runner that handles the tool-call loop. What it doesn't provide is any memory layer. Each Runner.run() call starts cold. The agent has no knowledge of previous conversations, previously established user preferences, or facts it learned three sessions ago.

For a simple Q&A bot, statelessness is fine. For any agent that's supposed to know you — a personal assistant, a support agent, a coding copilot — it becomes the core failure mode. Users repeat themselves. The agent gives the same generic answer it gave last week. Trust erodes.

Feather DB plugs this gap with two tools: search_memory (retrieve relevant context before responding) and add_memory (store facts after each turn). The agent calls them. Memory persists across runs in a single .feather file. Cold-start load in v0.16.0 is 48ms — memory is ready before your first API call completes.

Install

pip install feather-db openai openai-agents

Step 1: Initialize Feather DB and the embed function

import os
import feather_db as fdb
from openai import OpenAI

# v0.16.0: parallel HNSW load — 48ms cold start on 50k vectors
os.environ["FEATHER_LOAD_THREADS"] = "8"

openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# One .feather file per deployment — namespaces isolate each user
db = fdb.DB.open("agent_memory.feather", dim=1536)  # text-embedding-3-small

def embed(text: str) -> list[float]:
    resp = openai_client.embeddings.create(
        input=[text],
        model="text-embedding-3-small"
    )
    return resp.data[0].embedding

One file, all users. Namespace isolation (covered below) keeps their memories separate — no cross-contamination.

Step 2: Implement search_memory and add_memory

These are plain Python functions first. The Agents SDK wrapper comes in Step 3.

from datetime import datetime

def search_memory(
    query: str,
    user_id: str,
    k: int = 6,
    half_life: int = 30
) -> str:
    """
    Retrieve the k most relevant memories for this user.
    half_life controls decay speed in days — lower = faster fade.
    """
    vec = embed(query)
    results = db.context_chain(
        vec,
        k=k,
        namespace=user_id,   # each user_id is its own namespace
        max_depth=2,
        half_life=half_life
    )
    if not results:
        return "No relevant memories found."

    lines = [f"Retrieved {len(results)} memories:"]
    for i, mem in enumerate(results, 1):
        mem_type = mem.meta.get_attribute("type") or "message"
        lines.append(
            f"{i}. [{mem_type}] (score={mem.score:.3f}) {mem.text}"
        )
    return "\n".join(lines)


def add_memory(
    text: str,
    user_id: str,
    memory_type: str = "message",
    half_life: int = 30,
    importance: float = 1.0
) -> str:
    """
    Store a fact, preference, or message for this user.
    memory_type tags the entry; half_life and importance tune recall weight.
    """
    vec = embed(text)
    mem = db.add(
        vec,
        text=text,
        namespace=user_id,
        entity="conversation"
    )
    mem.meta.set_attribute("type", memory_type)
    mem.meta.set_attribute("importance", importance)
    mem.meta.set_attribute("half_life", half_life)
    mem.meta.set_attribute("created_at", datetime.utcnow().isoformat())
    return f"Stored (id={mem.id}): {text[:80]}"

Step 3: Wrap as Agents SDK tools and build the agent

from agents import Agent, Runner, function_tool

# Bind user_id at agent-construction time — one agent instance per user,
# or use a closure if you construct agents dynamically.
def make_memory_tools(user_id: str):

    @function_tool
    def search_memory_tool(query: str) -> str:
        """
        Search this user's memory for context relevant to the query.
        Call this at the start of every response before answering.
        """
        return search_memory(query, user_id=user_id)

    @function_tool
    def add_memory_tool(
        text: str,
        memory_type: str = "message",
        half_life: int = 30,
        importance: float = 1.0
    ) -> str:
        """
        Save information to this user's memory.
        memory_type: 'preference' | 'fact' | 'message' | 'decision'
        half_life: days until memory fades — use 180 for preferences, 7 for session facts
        importance: 0.5 (low) to 3.0 (critical); default 1.0
        """
        return add_memory(
            text,
            user_id=user_id,
            memory_type=memory_type,
            half_life=half_life,
            importance=importance
        )

    return search_memory_tool, add_memory_tool


def build_agent(user_id: str) -> Agent:
    search_tool, add_tool = make_memory_tools(user_id)

    return Agent(
        name="Assistant",
        instructions="""You are a helpful assistant with persistent memory.

On every turn:
1. Call search_memory_tool with the user's message to retrieve relevant context.
2. Use that context to personalize your response — reference what you know.
3. After responding, call add_memory_tool to save:
   - The user's message (memory_type='message', half_life=30)
   - Any preference the user revealed (memory_type='preference', half_life=180, importance=2.0)
   - Any important fact or decision (memory_type='fact', half_life=90, importance=1.5)

Be explicit when you recall something: "Based on what you told me earlier..."
Never pretend to know something you didn't retrieve from memory.""",
        tools=[search_tool, add_tool],
        model="gpt-4o"
    )

Step 4: Automatic memory on every turn

The pattern below runs a full conversation loop. Memory search happens before each response; memory write happens after. The agent handles both tool calls in its internal loop — you just pass the user message and get the response.

import asyncio

async def chat(user_id: str, message: str) -> str:
    agent = build_agent(user_id)
    result = await Runner.run(
        agent,
        input=message,
        max_turns=6   # search + respond + write = 3 turns minimum
    )
    return result.final_output


async def demo():
    user = "user_42"

    # Turn 1: user reveals a preference
    r1 = await chat(user, "I prefer concise bullet-point answers, not long paragraphs.")
    print(f"Turn 1: {r1}\n")

    # Turn 2: different topic — agent should still surface the preference
    r2 = await chat(user, "Explain how HNSW indexing works.")
    print(f"Turn 2: {r2}\n")

    # Turn 3: explicit recall test
    r3 = await chat(user, "What format do I prefer for answers?")
    print(f"Turn 3: {r3}\n")

asyncio.run(demo())

Turn 1 stores the preference with half_life=180 and importance=2.0. Turn 2's search_memory_tool call retrieves it before the HNSW explanation — the agent answers in bullets without being reminded. That's the payoff.

Step 5: Namespace per user — isolation by design

Every add_memory and search_memory call passes namespace=user_id. Feather DB enforces strict namespace isolation at the index level — a search in namespace="user_42" never touches vectors stored under namespace="user_99". No query-time filtering, no risk of leakage.

# Inspect what's stored for a specific user
user_vec = embed("user preferences")
results = db.search(user_vec, k=20, namespace="user_42")
print(f"user_42 has {len(results)} memories")

# Count across all namespaces
print(f"Total vectors in file: {db.count()}")
print(f"user_42 vectors: {db.count(namespace='user_42')}")

One .feather file serves every user in your system. Each user gets their own isolated memory space. No separate databases, no per-user deployments.

Step 6: Adaptive decay — preferences outlast session facts

Not all memories should fade at the same rate. A user's preferred response format should still surface six months from now. A fact from today's troubleshooting session is irrelevant by next week.

Feather DB's adaptive decay is controlled per-memory via half_life (days) and importance (weight multiplier). The agent's instructions encode these directly:

# Preference: long-lived, high importance
add_memory(
    "User prefers bullet-point answers over long paragraphs.",
    user_id="user_42",
    memory_type="preference",
    half_life=180,    # fades over ~6 months
    importance=2.0    # surfaces even when semantic match is weak
)

# Session fact: short-lived, normal importance
add_memory(
    "User is debugging a KeyError on line 47 of ingest.py.",
    user_id="user_42",
    memory_type="fact",
    half_life=7,      # fades after ~a week
    importance=1.0
)

# Conversational message: medium decay
add_memory(
    "User asked how HNSW handles deletions.",
    user_id="user_42",
    memory_type="message",
    half_life=30,
    importance=1.0
)

The agent instructions tell GPT-4o to set these values. In practice, the model applies them correctly for clear preference vs. fact vs. session signals — you don't need a separate classifier.

Step 7: add_batch() for history import

If a user already has an existing chat history — from another system, a CSV export, or a previous session log — use add_batch() to load it in one parallel call instead of a sequential loop. On a 4-core machine, add_batch() is 3.4× faster than sequential add() for bulk ingest.

import numpy as np

def import_chat_history(user_id: str, messages: list[dict]):
    """
    Bulk-load existing chat history into Feather DB.
    messages: list of {"role": "user"|"assistant", "content": str}
    """
    if not messages:
        return

    texts = [m["content"] for m in messages]
    roles = [m["role"] for m in messages]

    # Embed all messages in one batch API call
    response = openai_client.embeddings.create(
        input=texts,
        model="text-embedding-3-small"
    )
    vecs = np.array(
        [r.embedding for r in response.data],
        dtype=np.float32
    )

    # Build metadata — assign half_life by role
    metas = []
    for role in roles:
        m = fdb.Metadata(importance=1.0)
        m.set_attribute("type", "message")
        m.set_attribute("role", role)
        m.set_attribute(
            "half_life",
            30 if role == "user" else 14
        )
        m.set_attribute("source", "history_import")
        m.set_attribute("created_at", datetime.utcnow().isoformat())
        metas.append(m)

    # Parallel ingest — GIL released during HNSW graph construction
    ids = list(range(db.count(namespace=user_id),
                     db.count(namespace=user_id) + len(texts)))
    db.add_batch(ids, vecs, metas=metas, namespace=user_id)
    db.save()

    print(f"Imported {len(texts)} messages for {user_id}")


# Usage: load 500 historical messages before the first live turn
history = [
    {"role": "user", "content": "I always want code examples in Python 3.12."},
    {"role": "assistant", "content": "Noted — I'll use Python 3.12 syntax."},
    # ... 498 more
]
import_chat_history("user_42", history)

After import_chat_history() completes, the agent's search_memory_tool will surface relevant historical context on the very first live turn. No warm-up period needed.

Step 8: Inject retrieved context into the system prompt

The function-calling approach above works well and lets GPT-4o decide when to search. For tighter latency control, you can also pre-retrieve context server-side and inject it directly into the system prompt before calling the agent — bypassing one tool-call round-trip.

async def chat_with_preloaded_context(
    user_id: str,
    message: str
) -> str:
    # Retrieve before the API call — adds ~2ms, saves one tool-call round-trip
    context = search_memory(message, user_id=user_id, k=6)

    agent = Agent(
        name="Assistant",
        instructions=f"""You are a helpful assistant with persistent memory.

Relevant context retrieved from this user's memory:
{context}

Use this context to personalize your response.
After responding, call add_memory_tool to save any new preferences or facts.""",
        tools=[make_memory_tools(user_id)[1]],  # add_memory only — search already done
        model="gpt-4o"
    )

    result = await Runner.run(agent, input=message, max_turns=4)

    # Also store the turn explicitly
    add_memory(message, user_id=user_id, memory_type="message", half_life=30)
    add_memory(result.final_output, user_id=user_id,
               memory_type="message", half_life=14, importance=0.8)

    return result.final_output

Both patterns work. The function-calling version is more flexible — the agent decides relevance. The pre-injection version reduces round-trips and keeps total latency lower for high-traffic deployments.

Production: combine with OpenAI file search

Feather DB handles agent memory — the dynamic, evolving knowledge that accrues from interactions. OpenAI's built-in file search tool handles static document knowledge — product manuals, codebases, knowledge bases that don't change turn by turn. The two are complementary, not competing.

from openai import OpenAI
from agents import Agent, Runner, function_tool

client = OpenAI()

# Upload static documents to OpenAI for file search
vector_store = client.vector_stores.create(name="product-docs")
# ... upload your PDFs, markdown files, etc.

search_tool, add_tool = make_memory_tools("user_42")

production_agent = Agent(
    name="Production Assistant",
    instructions="""You are a helpful assistant with two knowledge sources:

1. File search (built-in): use for product documentation, technical specs, policies.
2. search_memory_tool: use for this specific user's history, preferences, and past interactions.

On every turn:
- Call search_memory_tool first for user-specific context.
- Use file search when the question requires authoritative product knowledge.
- After responding, call add_memory_tool to persist anything new about this user.""",
    tools=[
        search_tool,
        add_tool,
        # OpenAI file search is attached via tool_resources, not function_tool
    ],
    tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}},
    model="gpt-4o"
)

Document knowledge lives in OpenAI's infrastructure. User memory lives in Feather DB — local, fast, and owned by you. Neither competes for the other's role.

Performance in production

Operation	Latency	Notes
Cold start (v0.16.0, 50k vectors)	48ms	`FEATHER_LOAD_THREADS=8`
ANN search p50	0.19ms	500k vectors, k=10
ANN search p99	0.13ms	500k vectors, k=10
`add_batch()` 50k vectors	~10s	3.4× over sequential loop
Sequential `add()`	~2–5ms/call	includes embed round-trip

The 48ms cold start means memory is fully loaded before your first openai.chat.completions.create() call returns. In any I/O-dominated agent loop, Feather DB is not your latency bottleneck.

What you have

Persistent memory across runs — the agent remembers what it learned last session, last month, from the history import.
Per-user namespace isolation — one .feather file, zero cross-contamination between users.
Adaptive decay — preferences persist for six months (half_life=180); session facts fade in a week (half_life=7).
Fast history import — add_batch() loads existing chat logs 3.4× faster than a sequential loop.
48ms cold start — memory ready before the first API call completes.
Composable with OpenAI file search — document knowledge and agent memory as separate, non-competing layers.

Install: pip install feather-db openai openai-agents · GitHub: github.com/feather-store/feather