# In-RAM Int8 Quantization in Feather DB: 1.7× Less Memory

> Float32 vectors are convenient. They're also expensive — four bytes per dimension, every vector, sitting in RAM. Feather DB's int8 quantization cuts that by 4× at the raw level and delivers 1.7× real-world memory reduction once HNSW graph overhead is factored in. Here's how it works and when to reach for it.

- **Category**: Architecture
- **Read time**: 6 min read
- **Date**: June 16, 2026
- **Author**: Feather DB (Engineering)
- **URL**: https://getfeather.store/theory/feather-db-int8-quantization-memory-savings

---

## The memory math behind float32

A 768-dimensional float32 vector takes 3,072 bytes — exactly 3 KB. At 60,000 vectors, that's 180 MB of raw vector data before you add a single byte of HNSW graph structure. At 500K vectors, you're past 1.4 GB for vectors alone.

The HNSW graph makes this worse. With M=16, each node carries up to 16 neighbor pointers per level. That overhead adds roughly 25 MB per 100K vectors, on top of the vectors themselves. So a 60K-vector index at float32 lands at around 227 MB in practice.

For an embedded database that's supposed to run alongside an embedding model process on a 2 GB VPS, that's tight.

## What int8 quantization does

Int8 quantization converts each float32 value in a vector to an 8-bit signed integer. The conversion is linear: you record the maximum absolute value of the vector (`max_abs`), then scale every component to fit in the range [-127, 127].

One float32 = 4 bytes. One int8 = 1 byte. The theoretical compression ratio is 4×.

In practice, you don't get 4× memory reduction at the database level. The HNSW graph structure — neighbor lists, level assignments, the entry point — stores integer node IDs, not vectors. That overhead is the same whether your vectors are float32 or int8. At a 60K-vector index with M=16 HNSW, the graph structure accounts for around 15 MB, which doesn't compress. The net result: 227 MB (float32) down to 129 MB (int8) — a **1.76× reduction**.

That's the real-world number. Not 4×. But 1.76× on a memory-constrained host is the difference between OOM and stable operation.

## Feather DB's int8 implementation

Feather DB exposes int8 quantization as a first-class modality parameter. When you define a modality with `type="int8"`, Feather stores and searches those vectors in quantized form in RAM — no extra API call required, no post-load step.

```python
import feather_db as fdb

db = fdb.DB.open("memory.feather", dim=768)

# Register an int8 modality — quantization is automatic
db.add_modality("text", type="int8")

# Add vectors as float32 — Feather quantizes internally
db.add(vector=embedding, text="session summary from earlier today", modality="text")

# Search API is identical — no change at call sites
results = db.search(query_vec, k=10, modality="text")

```

The float32-to-int8 conversion happens at write time. You pass in a float32 embedding from your embedding model; Feather computes the scale factor, converts the vector to int8, and stores it. At search time, the HNSW traversal computes distances against int8 vectors — dequantizing per-comparison during graph traversal.

The `max_abs` scale factor is stored per-vector, not per-index. This matters because embedding model outputs aren't uniformly distributed — some vectors have components near ±1.0, others near ±0.1. Per-vector scaling preserves more information than a global scale would.

## v0.15.0: int8 HNSW graphs persist and round-trip

The first question anyone asks about in-RAM quantization: what happens when I save and reload?

In v0.15.0, the answer is: it round-trips exactly. When you save a `.feather` file containing int8 modalities, the HNSW graph for that modality is persisted in its int8 form — scale factors and all. On reload, the graph is reconstructed from the persisted int8 data without re-quantizing from float32.

This is verified in Feather's test suite: a save/reload cycle on an int8 modality produces byte-identical search results. No drift, no re-quantization error accumulation, no silent float32 fallback.

The file format v8 addition that makes this work is a per-vector scale factor section — a compact float32 array indexed by node position, stored alongside the int8 vectors. Readers that predate v8 skip this section gracefully. v8 readers that open v7 files see no int8 section and fall back to float32 search transparently.

## v0.16.0: int8 modalities get their own persisted int8 graph

v0.15.0 quantized vectors but shared the HNSW graph structure with float32 modalities. v0.16.0 goes further: when a modality is declared as int8, it gets its own dedicated HNSW graph — built, stored, and loaded entirely in int8 form.

Why this matters: a shared graph built on float32 neighbor relationships isn't optimal for int8 distance computation. The neighbor lists that minimize float32 L2 distance aren't exactly the same as those that minimize int8 L2 distance. A dedicated int8 graph is built using int8 distance from the start, which produces better recall at the same graph size.

The practical impact on load time: because the persisted int8 graph is loaded directly without any float32 intermediate, startup time for int8 modalities stays consistent with float32 modality load times. The "fast load maintained" property holds: a 40K-vector int8 modality using `FEATHER_LOAD_THREADS=8` loads in roughly the same wall time as its float32 equivalent.

## Recall impact: the honest number

Int8 quantization introduces rounding error. Each float32 component gets mapped to one of 255 discrete values. The L2 distance computed between two quantized vectors is an approximation of the true float32 L2 distance.

In Feather's benchmarks at 60K × 768-dim:

ModeRAMRecall@10

float32 (default)227 MB0.972
int8 modality129 MB~0.965

That's roughly a 0.5–1 percentage point recall reduction. For most AI memory workloads — surfacing 5–10 context chunks from a session history — a recall@10 of 0.965 is indistinguishable from 0.972 in production. You'd need a careful A/B test to see a difference in answer quality.

Where the recall gap matters: precision benchmarks like LongMemEval, where every percentage point of recall translates to downstream answer accuracy. For those workloads, stick with float32.

## SIMD and int8: what's dispatched today

Feather's SSE/AVX kernels on x86 are dispatched at runtime based on CPUID. The current SIMD kernels operate on float32 — so during HNSW traversal over an int8 modality, vectors are dequantized to float32 before the L2 computation, and the AVX kernel handles the float32 distance.

This means you still get the SIMD speedup on x86 (AVX2: 8 floats/op; AVX-512: 16 floats/op). You also get an additional benefit that isn't in the kernel itself: **cache locality**. Int8 vectors are 4× smaller than float32 vectors, which means more candidate vectors fit in L1/L2 cache during graph traversal. On memory-constrained hardware, this reduces cache miss frequency during HNSW search — a compound benefit on top of the lower RAM footprint.

Native int8 SIMD distance kernels — computing L2 directly in int8 arithmetic without float32 dequantization — are on the Feather roadmap. That would compound the memory savings and the compute savings. For now, the cache locality effect partially captures the benefit.

## When to use int8

**Reach for `type="int8"` when:**

- You're running Feather on a memory-constrained host: a 1–2 GB VPS, a Raspberry Pi, an edge device, or a container with a strict memory limit

- Your index is large — 50K+ vectors at 768-dim, where float32 RAM usage starts pressing against available memory

- You're running multiple `.feather` files concurrently on one machine (e.g., per-user memory stores for a multi-tenant agent service)

- Your workload is context retrieval, not precision ranking — a 0.965 recall@10 is fine for surfacing memory chunks

**Stick with float32 when:**

- RAM isn't the constraint and your corpus is under 20K vectors

- You're running LongMemEval-style precision benchmarks where every recall point matters

- You have SLA commitments on exact recall numbers

## Maximum memory efficiency: int8 + adaptive index capacity

Int8 modalities combine cleanly with Feather's adaptive index capacity feature, which pre-allocates HNSW graph memory based on observed growth rate rather than a fixed `max_elements` cap. The combination matters because HNSW graph memory scales with the number of nodes — over-allocating `max_elements` wastes RAM even before you add vectors.

With adaptive capacity, the index grows in increments tuned to your actual ingest rate. With int8, each vector takes 1 byte per dimension instead of 4. The two features are multiplicative: a 768-dim index that would have peaked at 800 MB with float32 and static over-allocation can run within 200 MB with int8 modality and adaptive capacity enabled.

```python
import os, feather_db as fdb

# Parallel load for fast startup
os.environ["FEATHER_LOAD_THREADS"] = "8"

db = fdb.DB.open("memory.feather", dim=768)

# Int8 modality with adaptive capacity
db.add_modality("text", type="int8", adaptive_capacity=True)

# Add vectors — quantized at write time, adaptive graph sizing
db.add(vector=embedding, text="context chunk", modality="text")

# Search — same API, int8 under the hood
results = db.search(query_vec, k=10, modality="text")
print(f"Top result: {results[0].text}")

```

On a 4-core VPS with 2 GB RAM running a 60K-vector agent memory store alongside an Ollama embedding process, this pattern is what keeps both processes alive. Float32 + static allocation would OOM. Int8 + adaptive capacity fits.

## The upgrade path

If you have an existing float32 index and want to switch to int8, there's no re-ingest required. The migration path:

- Open the existing `.feather` file as usual

- Call `db.add_modality("text", type="int8")` to register the int8 modality

- Use `db.set_int8_ram("text")` to quantize the in-memory vectors immediately

- Save — the resulting v8 file persists the int8 graph for fast future loads

New vectors added after the modality switch are quantized at write time. Vectors added before the switch are quantized when `set_int8_ram()` is called. The HNSW graph is rebuilt in int8 form on the next load (v0.16.0 behavior), so the first save-and-reload after migration produces the dedicated int8 graph.

**Install:** `pip install feather-db` · **GitHub:** [github.com/feather-store/feather](https://github.com/feather-store/feather)

---

*This is the machine-readable mirror of the theory post at [getfeather.store/theory/feather-db-int8-quantization-memory-savings](https://getfeather.store/theory/feather-db-int8-quantization-memory-savings). For the full Feather DB documentation, see [getfeather.store/llms-full.txt](https://getfeather.store/llms-full.txt).*