# Feather DB v0.15: In-RAM int8 Quantization — 1.7× Less Memory, Near-Identical Recall > Feather DB v0.15.0 ships in-RAM int8 quantization via set_int8_ram(). At 60k × 768-dim, RAM drops from 227 MB to 129 MB with recall@10 staying at ~0.88. Here's what changed and when to use it. - **Category**: Architecture - **Read time**: 6 min read - **Date**: June 16, 2026 - **Author**: Feather DB (Engineering) - **URL**: https://getfeather.store/theory/feather-db-v015-int8-quantization --- ## What shipped in v0.15.0 Feather DB v0.15.0 introduces **in-RAM int8 quantization** via a single call: `db.set_int8_ram(modality, max_abs)`. After loading a `.feather` file, you can quantize its vectors in memory — no re-ingest, no format change, no infrastructure change. The numbers at 60k × 768-dim float32: ModeRAMRecall@10 float32 (default)227 MB0.972 int8 in-RAM129 MB~0.88 That's a **1.76× RAM reduction** at a **~0.09 recall cost**. For most AI memory workloads — where you're retrieving 5–10 context chunks, not ranking a million candidates — 0.88 recall@10 is completely acceptable. ## How it works Feather DB stores vectors as float32 in both file and memory by default. File format v7 (v0.12.0) introduced on-disk int8 quantization — ~2.5–4× smaller files. But on-disk quantization doesn't help RAM usage if you load and dequantize to float32 at startup. File format v8, shipped with v0.15.0, adds a per-vector scaling factor stored alongside int8 bytes. When `set_int8_ram()` is called, Feather quantizes the in-memory HNSW graph in-place, keeping vectors as `int8` with a per-vector `max_abs` scale. The L2 distance kernel was updated to handle the scale factor during ANN search — no precision loss from the quantization math is introduced beyond the inherent int8 rounding. ```python import feather_db as fdb db = fdb.DB.open("large.feather", dim=768) # Quantize the "text" modality in RAM after load db.set_int8_ram("text", max_abs=1.0) # All subsequent searches use int8 — same API results = db.search(query_vec, k=10) ``` Backward compatibility: **v3–v7 files load transparently**. You don't need to re-export or migrate existing files to use in-RAM quantization. The quantization happens in memory after `open()`. ## When to use it **Use int8 RAM quantization when:** - You're running Feather on a memory-constrained host (VPS with 1–2 GB RAM, Raspberry Pi, edge device) - Your index is large (50k+ vectors at 768-dim) - Your workload is context retrieval, not precision ranking — recall@10 = 0.88 is fine for surfacing 5–10 memory chunks - You want to run multiple `.feather` files concurrently on one machine **Stick with float32 when:** - You're running precision benchmarks (like LongMemEval where +0.09 recall may matter) - RAM isn't a constraint and you have fewer than 20k vectors - You need exact recall numbers for SLA commitments ## The file format v8 story Feather has shipped four quantization layers across versions: - **v0.12.0 / format v7**: On-disk int8. Smaller files (2.5–4×), but float32 in RAM after load. - **v0.15.0 / format v8**: In-RAM int8. Smaller memory footprint after `set_int8_ram()`. Both can coexist — you can have on-disk int8 *and* in-RAM int8 simultaneously for maximum density. Format v8 files are not backward-compatible with pre-v0.15 builds. If you save after calling `set_int8_ram()`, the resulting file is v8. If you need to share files with older Feather installs, skip the save step and keep the quantization ephemeral (in-RAM only). ## Combining with parallel load Phase 8 also shipped **parallel HNSW load** — `FEATHER_LOAD_THREADS=8` reduces cold-start time from 7.6s to 1.7s for a 40k × 128-dim index (4.7×). Combined with int8 quantization, the startup pattern for a memory-constrained server looks like: ```python import os, feather_db as fdb os.environ["FEATHER_LOAD_THREADS"] = "8" db = fdb.DB.open("persona.feather", dim=768) # 4.7x faster load db.set_int8_ram("text", max_abs=1.0) # 1.7x less RAM print(f"Ready. Vectors: {db.count()}") ``` On a 4-core VPS with 2 GB RAM, this pattern is what enables Feather to run alongside the embedding model process without OOM-killing either. ## What's next The natural next step is **int8 search kernels** — computing L2 distance directly in int8 arithmetic via SIMD, which would cut compute per comparison by ~2× on top of the memory savings. That work is tracked in the Feather repo. For now, int8 vectors are dequantized per-comparison during HNSW traversal — still faster than float32 HNSW on RAM-constrained hosts due to improved cache locality. v0.15.1 also shipped real embedders via `--embed-provider` in `feather-serve`. The two features compose: run `feather-serve` with Gemini embeddings, quantize in RAM after load, and you get semantic persona recall at minimum memory cost. **Install:** `pip install feather-db==0.15.1` **GitHub:** [github.com/feather-store/feather](https://github.com/feather-store/feather) --- *This is the machine-readable mirror of the theory post at [getfeather.store/theory/feather-db-v015-int8-quantization](https://getfeather.store/theory/feather-db-v015-int8-quantization). For the full Feather DB documentation, see [getfeather.store/llms-full.txt](https://getfeather.store/llms-full.txt).*