Feather DB v0.15: In-RAM int8 Quantization — 1.7× Less Memory, Near-Identical Recall
Feather DB v0.15.0 ships in-RAM int8 quantization via set_int8_ram(). At 60k × 768-dim, RAM drops from 227 MB to 129 MB with recall@10 staying at ~0.88. Here's what changed and when to use it.
What shipped in v0.15.0
Feather DB v0.15.0 introduces in-RAM int8 quantization via a single call: db.set_int8_ram(modality, max_abs). After loading a .feather file, you can quantize its vectors in memory — no re-ingest, no format change, no infrastructure change.
The numbers at 60k × 768-dim float32:
| Mode | RAM | Recall@10 |
|---|---|---|
| float32 (default) | 227 MB | 0.972 |
| int8 in-RAM | 129 MB | ~0.88 |
That's a 1.76× RAM reduction at a ~0.09 recall cost. For most AI memory workloads — where you're retrieving 5–10 context chunks, not ranking a million candidates — 0.88 recall@10 is completely acceptable.
How it works
Feather DB stores vectors as float32 in both file and memory by default. File format v7 (v0.12.0) introduced on-disk int8 quantization — ~2.5–4× smaller files. But on-disk quantization doesn't help RAM usage if you load and dequantize to float32 at startup.
File format v8, shipped with v0.15.0, adds a per-vector scaling factor stored alongside int8 bytes. When set_int8_ram() is called, Feather quantizes the in-memory HNSW graph in-place, keeping vectors as int8 with a per-vector max_abs scale. The L2 distance kernel was updated to handle the scale factor during ANN search — no precision loss from the quantization math is introduced beyond the inherent int8 rounding.
import feather_db as fdb
db = fdb.DB.open("large.feather", dim=768)
# Quantize the "text" modality in RAM after load
db.set_int8_ram("text", max_abs=1.0)
# All subsequent searches use int8 — same API
results = db.search(query_vec, k=10)
Backward compatibility: v3–v7 files load transparently. You don't need to re-export or migrate existing files to use in-RAM quantization. The quantization happens in memory after open().
When to use it
Use int8 RAM quantization when:
- You're running Feather on a memory-constrained host (VPS with 1–2 GB RAM, Raspberry Pi, edge device)
- Your index is large (50k+ vectors at 768-dim)
- Your workload is context retrieval, not precision ranking — recall@10 = 0.88 is fine for surfacing 5–10 memory chunks
- You want to run multiple
.featherfiles concurrently on one machine
Stick with float32 when:
- You're running precision benchmarks (like LongMemEval where +0.09 recall may matter)
- RAM isn't a constraint and you have fewer than 20k vectors
- You need exact recall numbers for SLA commitments
The file format v8 story
Feather has shipped four quantization layers across versions:
- v0.12.0 / format v7: On-disk int8. Smaller files (2.5–4×), but float32 in RAM after load.
- v0.15.0 / format v8: In-RAM int8. Smaller memory footprint after
set_int8_ram(). Both can coexist — you can have on-disk int8 and in-RAM int8 simultaneously for maximum density.
Format v8 files are not backward-compatible with pre-v0.15 builds. If you save after calling set_int8_ram(), the resulting file is v8. If you need to share files with older Feather installs, skip the save step and keep the quantization ephemeral (in-RAM only).
Combining with parallel load
Phase 8 also shipped parallel HNSW load — FEATHER_LOAD_THREADS=8 reduces cold-start time from 7.6s to 1.7s for a 40k × 128-dim index (4.7×). Combined with int8 quantization, the startup pattern for a memory-constrained server looks like:
import os, feather_db as fdb
os.environ["FEATHER_LOAD_THREADS"] = "8"
db = fdb.DB.open("persona.feather", dim=768) # 4.7x faster load
db.set_int8_ram("text", max_abs=1.0) # 1.7x less RAM
print(f"Ready. Vectors: {db.count()}")
On a 4-core VPS with 2 GB RAM, this pattern is what enables Feather to run alongside the embedding model process without OOM-killing either.
What's next
The natural next step is int8 search kernels — computing L2 distance directly in int8 arithmetic via SIMD, which would cut compute per comparison by ~2× on top of the memory savings. That work is tracked in the Feather repo. For now, int8 vectors are dequantized per-comparison during HNSW traversal — still faster than float32 HNSW on RAM-constrained hosts due to improved cache locality.
v0.15.1 also shipped real embedders via --embed-provider in feather-serve. The two features compose: run feather-serve with Gemini embeddings, quantize in RAM after load, and you get semantic persona recall at minimum memory cost.
Install: pip install feather-db==0.15.1
GitHub: github.com/feather-store/feather