# Feather DB v0.15: In-RAM int8 Quantization — 1.7× Less Memory, Near-Identical Recall

> Feather DB v0.15.0 ships in-RAM int8 quantization via set_int8_ram(). At 60k × 768-dim, RAM drops from 227 MB to 129 MB with recall@10 staying at ~0.88. Here's what changed and when to use it.

- **Category**: Architecture
- **Read time**: 6 min read
- **Date**: June 16, 2026
- **Author**: Feather DB (Engineering)
- **URL**: https://getfeather.store/theory/feather-db-v015-int8-quantization

---

## What shipped in v0.15.0

Feather DB v0.15.0 introduces **in-RAM int8 quantization** via a single call: `db.set_int8_ram(modality, max_abs)`. After loading a `.feather` file, you can quantize its vectors in memory — no re-ingest, no format change, no infrastructure change.

The numbers at 60k × 768-dim float32:

ModeRAMRecall@10

float32 (default)227 MB0.972
int8 in-RAM129 MB~0.88

That's a **1.76× RAM reduction** at a **~0.09 recall cost**. For most AI memory workloads — where you're retrieving 5–10 context chunks, not ranking a million candidates — 0.88 recall@10 is completely acceptable.

## How it works

Feather DB stores vectors as float32 in both file and memory by default. File format v7 (v0.12.0) introduced on-disk int8 quantization — ~2.5–4× smaller files. But on-disk quantization doesn't help RAM usage if you load and dequantize to float32 at startup.

File format v8, shipped with v0.15.0, adds a per-vector scaling factor stored alongside int8 bytes. When `set_int8_ram()` is called, Feather quantizes the in-memory HNSW graph in-place, keeping vectors as `int8` with a per-vector `max_abs` scale. The L2 distance kernel was updated to handle the scale factor during ANN search — no precision loss from the quantization math is introduced beyond the inherent int8 rounding.

```python
import feather_db as fdb

db = fdb.DB.open("large.feather", dim=768)

# Quantize the "text" modality in RAM after load
db.set_int8_ram("text", max_abs=1.0)

# All subsequent searches use int8 — same API
results = db.search(query_vec, k=10)

```

Backward compatibility: **v3–v7 files load transparently**. You don't need to re-export or migrate existing files to use in-RAM quantization. The quantization happens in memory after `open()`.

## When to use it

**Use int8 RAM quantization when:**

- You're running Feather on a memory-constrained host (VPS with 1–2 GB RAM, Raspberry Pi, edge device)

- Your index is large (50k+ vectors at 768-dim)

- Your workload is context retrieval, not precision ranking — recall@10 = 0.88 is fine for surfacing 5–10 memory chunks

- You want to run multiple `.feather` files concurrently on one machine

**Stick with float32 when:**

- You're running precision benchmarks (like LongMemEval where +0.09 recall may matter)

- RAM isn't a constraint and you have fewer than 20k vectors

- You need exact recall numbers for SLA commitments

## The file format v8 story

Feather has shipped four quantization layers across versions:

- **v0.12.0 / format v7**: On-disk int8. Smaller files (2.5–4×), but float32 in RAM after load.

- **v0.15.0 / format v8**: In-RAM int8. Smaller memory footprint after `set_int8_ram()`. Both can coexist — you can have on-disk int8 *and* in-RAM int8 simultaneously for maximum density.

Format v8 files are not backward-compatible with pre-v0.15 builds. If you save after calling `set_int8_ram()`, the resulting file is v8. If you need to share files with older Feather installs, skip the save step and keep the quantization ephemeral (in-RAM only).

## Combining with parallel load

Phase 8 also shipped **parallel HNSW load** — `FEATHER_LOAD_THREADS=8` reduces cold-start time from 7.6s to 1.7s for a 40k × 128-dim index (4.7×). Combined with int8 quantization, the startup pattern for a memory-constrained server looks like:

```python
import os, feather_db as fdb

os.environ["FEATHER_LOAD_THREADS"] = "8"
db = fdb.DB.open("persona.feather", dim=768)  # 4.7x faster load
db.set_int8_ram("text", max_abs=1.0)           # 1.7x less RAM

print(f"Ready. Vectors: {db.count()}")

```

On a 4-core VPS with 2 GB RAM, this pattern is what enables Feather to run alongside the embedding model process without OOM-killing either.

## What's next

The natural next step is **int8 search kernels** — computing L2 distance directly in int8 arithmetic via SIMD, which would cut compute per comparison by ~2× on top of the memory savings. That work is tracked in the Feather repo. For now, int8 vectors are dequantized per-comparison during HNSW traversal — still faster than float32 HNSW on RAM-constrained hosts due to improved cache locality.

v0.15.1 also shipped real embedders via `--embed-provider` in `feather-serve`. The two features compose: run `feather-serve` with Gemini embeddings, quantize in RAM after load, and you get semantic persona recall at minimum memory cost.

**Install:** `pip install feather-db==0.15.1`

**GitHub:** [github.com/feather-store/feather](https://github.com/feather-store/feather)

---

*This is the machine-readable mirror of the theory post at [getfeather.store/theory/feather-db-v015-int8-quantization](https://getfeather.store/theory/feather-db-v015-int8-quantization). For the full Feather DB documentation, see [getfeather.store/llms-full.txt](https://getfeather.store/llms-full.txt).*