# Vector Database Quantization Guide: int8, Float16, and Beyond

> Quantization trades a small amount of recall precision for a large reduction in memory. Here is the math, the tradeoffs, and how Feather DB's int8 modality puts it into practice.

- **Category**: Architecture
- **Read time**: 7 min read
- **Date**: June 16, 2026
- **Author**: Feather DB Engineering (Engineering Team)
- **URL**: https://getfeather.store/theory/vector-database-quantization-guide

---

# Vector Database Quantization Guide: int8, Float16, and Beyond

*Architecture Deep Dive · Feather DB v0.15.0 / v0.16.0 · June 2026*

---

## What Quantization Is

Every vector in a database is a list of floating-point numbers. By default, those numbers are stored as 32-bit floats (`float32`). Quantization reduces that precision: to 16-bit floats (`float16`), to 8-bit integers (`int8`), or further still to single bits (binary quantization).

Less precision means smaller numbers. Smaller numbers mean less memory. Less memory means you can fit a larger index in RAM — or the same index in a smaller, cheaper machine.

The cost is a small loss of recall accuracy. For most production workloads, that loss is acceptable. For some it is not. The rest of this guide is about understanding that tradeoff precisely.

---

## The Memory Math

Start with a concrete number. A standard embedding model produces 768-dimensional vectors. Here is what one vector costs at each precision level:

Precision
Bytes per dimension
Bytes per 768-dim vector
Reduction vs float32

float32
4
3,072
1×

float16
2
1,536
2×

int8
1
768
4×

binary
0.125
96
32×

A 1M-vector index at float32 costs roughly 3 GB of raw vector data. At int8, that drops to ~768 MB. On paper, a 4× raw reduction.

### Why Real-World Reduction Is Lower

The raw vector data is not the only thing in memory. An HNSW index also stores the graph structure: neighbor lists, layer assignments, entry-point metadata. That graph overhead is stored in full precision regardless of which quantization you apply to the vectors themselves.

In Feather DB's benchmarks on a 500K-vector index, switching from float32 to int8 produces approximately a **1.7× reduction** in total resident memory. Still significant — but worth setting expectations correctly before you size your infrastructure.

The overhead fraction shrinks as your index grows. At very large index sizes (tens of millions of vectors), real-world reduction moves closer to the theoretical 4×.

---

## Quality Impact

Scalar quantization maps each float32 value to the nearest representable int8 value using a per-dimension or per-tensor scale factor. The mapping is lossy: you lose the fine-grained differences between values that fall in the same quantization bin.

In practice, the impact on retrieval quality is small:

- int8 typically loses **less than 1% recall@10** compared to float32 on standard ANN benchmarks.

- float16 loses even less — often less than 0.1%.

- Binary quantization loses considerably more, typically 5–15% depending on the dataset.

The reason int8 holds up well is that ANN search is fundamentally a ranking problem, not a distance measurement problem. You need the nearest neighbors to be ranked correctly relative to each other, not the exact cosine distances to be numerically precise. Quantization noise tends to be uniformly distributed across dimensions, which preserves relative rankings even as it distorts absolute values.

Feather DB's internal benchmarks on 97.2% recall@10 at float32 show int8 holding at approximately 96.4% — a 0.8 percentage point delta that is invisible in most production applications.

---

## Feather DB int8: How It Works

### v0.15.0: In-RAM int8 Modality

Feather DB v0.15.0 introduced a dedicated int8 modality. When you add vectors with `modality="int8"`, Feather DB:

- Accepts float32 vectors from your application (no changes to your embedding pipeline).

- Quantizes them to int8 in-process before inserting into the HNSW index.

- Stores int8 vectors in memory for the lifetime of the session.

- Persists **exact float32 values** to disk in the `.feather` binary format, ensuring lossless round-trip.

The exact persistence design means you get the memory benefit in RAM without giving up numerical fidelity on disk. On reload, Feather DB re-quantizes from the stored float32 values, keeping the quantization consistent across restarts.

```python
import feather_db

db = feather_db.DB.open("my_index.feather", dim=768)

# add() with modality="int8" — vectors quantized in RAM, float32 on disk
db.add(id=1, vec=my_float32_embedding, modality="int8")
db.add(id=2, vec=another_embedding,    modality="int8")

# Search works identically — no changes to query path
results = db.search(query_vec, k=10)

```

No embedding model changes. No query path changes. One parameter.

### v0.16.0: int8 HNSW Graph Persistence

v0.15.0 had one remaining cost: on index load, Feather DB re-inserted vectors from the stored float32 values into a fresh HNSW graph. For large indexes, this rebuild added several seconds of startup latency.

v0.16.0 eliminates that cost. int8 modalities now persist their own int8 HNSW graph to disk alongside the float32 vectors. On `open()`, the graph loads directly from the persisted int8 representation — startup time is equivalent to loading a float32 index of the same node count, not rebuilding it.

This makes int8 a first-class production modality: minimum RAM during operation, fast startup, lossless underlying data.

---

## When Not to Use Quantization

Quantization is not always the right call. Three situations where you should stay at float32:

**High-precision applications.** Medical record retrieval, legal document similarity, financial compliance matching — workloads where a false negative carries real consequences. Even a sub-1% recall drop may not be acceptable.

**Small indexes.** If your index holds fewer than ~50,000 vectors, total memory consumption is already small. The HNSW graph overhead is proportionally larger, so your real-world reduction will be well below 1.7×. The complexity of managing a quantized modality is not worth fractional gains.

**Experimental or iterative workflows.** If you are still running ablations on embedding models or index parameters, staying at float32 removes one variable from your debugging surface. Switch to int8 once your index design is stable.

---

## Combining with Adaptive Capacity

Feather DB's adaptive capacity feature lets you start an index with a small initial allocation and grow it dynamically as vectors are added. int8 and adaptive capacity compose directly, and together they represent the minimum possible RAM footprint for a Feather DB deployment:

```python
db = feather_db.DB.open(
    "compact.feather",
    dim=768,
    initial_capacity=1_000,   # start small
    growth_factor=2.0          # double when needed
)

# All adds use int8 — in-RAM quantization active from first insert
db.add(id=1, vec=embedding, modality="int8")

```

With a 1,000-vector initial capacity and int8, your index starts at roughly 750 KB of vector memory before graph overhead. At float32 with a 100,000-vector pre-allocation, you would pre-commit ~300 MB before ingesting a single vector. For agents that start small and grow — which is most agents — this matters.

---

## Quantization Methods Compared

Feather DB uses scalar quantization. It is not the only approach. Here is how the three main methods compare:

Method
How it works
Memory reduction
Recall loss
Used in

**Scalar quantization**
Maps each float32 value to int8 using a linear scale per dimension or per tensor.
~1.7–4× (real-world)
<1%
Feather DB, Weaviate, Qdrant

**Product quantization (PQ)**
Splits the vector into sub-vectors, encodes each sub-vector as an index into a learned codebook.
8–64×
2–8%
Faiss (IVF-PQ), Pinecone

**Binary quantization**
Maps each float32 to a single bit (positive → 1, negative → 0). Hamming distance replaces cosine.
32×
5–15%
Vespa, Qdrant (experimental)

**Scalar quantization vs product quantization:** PQ achieves higher compression ratios but requires a training step on a representative sample of your data to build the codebooks. That training needs to run again if your data distribution shifts significantly. Scalar quantization has no training step — it is applied directly to each vector using statistics from the current batch or index. This makes it operationally simpler and more predictable for indexes that grow incrementally.

**Scalar quantization vs binary:** Binary quantization offers the highest compression, but the recall loss is substantially higher and more dataset-dependent. Binary quantization tends to perform better on datasets where vectors have high angular separation (distinct semantic clusters) and poorly on densely packed, fine-grained retrieval tasks. Scalar int8 is more consistent across dataset types.

For most agent memory and RAG workloads — where you are storing conversational context, documents, or creative assets and need reliable recall — scalar int8 is the right tradeoff. PQ is worth exploring when you are at scales above 10M vectors and RAM is severely constrained. Binary quantization is a last resort or a hardware-specific optimization (popcount instructions are very fast on modern CPUs).

---

## The Short Version

- float32 → int8 gives you a 4× raw reduction and ~1.7× real-world reduction in Feather DB (HNSW graph overhead is the gap).

- Recall loss is typically under 1% — acceptable for the vast majority of production workloads.

- Enable with `modality="int8"` in `add()`. No changes to your embedding pipeline or query code.

- v0.15.0 added in-RAM int8 with lossless float32 persistence. v0.16.0 added persisted int8 HNSW graphs for fast startup.

- Pair with adaptive capacity for minimum initial RAM footprint.

- Skip quantization for small indexes, high-precision workloads, or while iterating on your index design.

- Scalar quantization (Feather DB) requires no training step. Product quantization (Faiss) achieves higher compression but needs codebook training. Binary quantization has the highest compression and the highest recall cost.

---

*Feather DB v0.15.0 and v0.16.0 — [github.com/feather-store/feather](https://github.com/feather-store/feather)*

---

*This is the machine-readable mirror of the theory post at [getfeather.store/theory/vector-database-quantization-guide](https://getfeather.store/theory/vector-database-quantization-guide). For the full Feather DB documentation, see [getfeather.store/llms-full.txt](https://getfeather.store/llms-full.txt).*