Feather DB v0.15: In-RAM int8 Quantization — 1.7× Less Memory, Near-Identical Recall

What shipped in v0.15.0

Feather DB v0.15.0 introduces in-RAM int8 quantization via a single call: db.set_int8_ram(modality, max_abs). After loading a .feather file, you can quantize its vectors in memory — no re-ingest, no format change, no infrastructure change.

The numbers at 60k × 768-dim float32:

Mode	RAM	Recall@10
float32 (default)	227 MB	0.972
int8 in-RAM	129 MB	~0.88

That's a 1.76× RAM reduction at a ~0.09 recall cost. For most AI memory workloads — where you're retrieving 5–10 context chunks, not ranking a million candidates — 0.88 recall@10 is completely acceptable.

How it works

Feather DB stores vectors as float32 in both file and memory by default. File format v7 (v0.12.0) introduced on-disk int8 quantization — ~2.5–4× smaller files. But on-disk quantization doesn't help RAM usage if you load and dequantize to float32 at startup.

File format v8, shipped with v0.15.0, adds a per-vector scaling factor stored alongside int8 bytes. When set_int8_ram() is called, Feather quantizes the in-memory HNSW graph in-place, keeping vectors as int8 with a per-vector max_abs scale. The L2 distance kernel was updated to handle the scale factor during ANN search — no precision loss from the quantization math is introduced beyond the inherent int8 rounding.

import feather_db as fdb

db = fdb.DB.open("large.feather", dim=768)

# Quantize the "text" modality in RAM after load
db.set_int8_ram("text", max_abs=1.0)

# All subsequent searches use int8 — same API
results = db.search(query_vec, k=10)

Backward compatibility: v3–v7 files load transparently. You don't need to re-export or migrate existing files to use in-RAM quantization. The quantization happens in memory after open().

When to use it

Use int8 RAM quantization when:

You're running Feather on a memory-constrained host (VPS with 1–2 GB RAM, Raspberry Pi, edge device)
Your index is large (50k+ vectors at 768-dim)
Your workload is context retrieval, not precision ranking — recall@10 = 0.88 is fine for surfacing 5–10 memory chunks
You want to run multiple .feather files concurrently on one machine

Stick with float32 when:

You're running precision benchmarks (like LongMemEval where +0.09 recall may matter)
RAM isn't a constraint and you have fewer than 20k vectors
You need exact recall numbers for SLA commitments

The file format v8 story

Feather has shipped four quantization layers across versions:

v0.12.0 / format v7: On-disk int8. Smaller files (2.5–4×), but float32 in RAM after load.
v0.15.0 / format v8: In-RAM int8. Smaller memory footprint after set_int8_ram(). Both can coexist — you can have on-disk int8 and in-RAM int8 simultaneously for maximum density.

Format v8 files are not backward-compatible with pre-v0.15 builds. If you save after calling set_int8_ram(), the resulting file is v8. If you need to share files with older Feather installs, skip the save step and keep the quantization ephemeral (in-RAM only).

Combining with parallel load

Phase 8 also shipped parallel HNSW load — FEATHER_LOAD_THREADS=8 reduces cold-start time from 7.6s to 1.7s for a 40k × 128-dim index (4.7×). Combined with int8 quantization, the startup pattern for a memory-constrained server looks like:

import os, feather_db as fdb

os.environ["FEATHER_LOAD_THREADS"] = "8"
db = fdb.DB.open("persona.feather", dim=768)  # 4.7x faster load
db.set_int8_ram("text", max_abs=1.0)           # 1.7x less RAM

print(f"Ready. Vectors: {db.count()}")

On a 4-core VPS with 2 GB RAM, this pattern is what enables Feather to run alongside the embedding model process without OOM-killing either.

What's next

The natural next step is int8 search kernels — computing L2 distance directly in int8 arithmetic via SIMD, which would cut compute per comparison by ~2× on top of the memory savings. That work is tracked in the Feather repo. For now, int8 vectors are dequantized per-comparison during HNSW traversal — still faster than float32 HNSW on RAM-constrained hosts due to improved cache locality.

v0.15.1 also shipped real embedders via --embed-provider in feather-serve. The two features compose: run feather-serve with Gemini embeddings, quantize in RAM after load, and you get semantic persona recall at minimum memory cost.

Install: pip install feather-db==0.15.1

GitHub: github.com/feather-store/feather