All about vector quantization
Benchmarking scalar and product quantization methods
txtai supports a number of approximate nearest neighbor (ANN) libraries for vector storage. This includes Faiss, Hnswlib, Annoy, NumPy and PyTorch. Custom implementations can also be added.
The default ANN for txtai is Faiss. Faiss has by far the largest array of configurable options in building an ANN index. This article will cover quantization and different approaches that are possible along with the tradeoffs.
Install dependencies
Install txtai
and all dependencies.
# Install txtai
pip install txtai pytrec_eval rank-bm25 elasticsearch psutil
Preparing the datasets
First, let's download a subset of the datasets from the BEIR evaluation framework. We'll also retrieve the standard txtai benchmark script. These will be used to help judge the accuracy of quantization methods.
import os
# Get benchmarks script
os.system("wget https://raw.githubusercontent.com/neuml/txtai/master/examples/benchmarks.py")
# Create output directory
os.makedirs("beir", exist_ok=True)
if os.path.exists("benchmarks.json"):
os.remove("benchmarks.json")
# Download subset of BEIR datasets
datasets = ["nfcorpus", "arguana", "scifact"]
for dataset in datasets:
url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset}.zip"
os.system(f"wget {url}")
os.system(f"mv {dataset}.zip beir")
os.system(f"unzip -d beir beir/{dataset}.zip")
Evaluation
Next, we'll setup the scaffolding to run evaluations.
import pandas as pd
import yaml
def writeconfig(dataset, quantize):
sources = {"arguana": "IVF11", "nfcorpus": "IDMap", "scifact": "IVF6"}
config = {
"embeddings": {
"batch": 8192,
"encodebatch": 128,
"faiss": {
"sample": 0.05
}
}
}
if quantize and quantize[-1].isdigit() and int(quantize[-1]) < 4:
# Use vector quantization for 1, 2 and 3 bit quantization
config["embeddings"]["quantize"] = int(quantize[-1])
elif quantize:
# Use Faiss quantization for other forms of quantization
config["embeddings"]["faiss"]["components"] = f"{sources[dataset]},{quantize}"
# Derive name
name = quantize if quantize else "baseline"
# Derive config path and write output
path = f"{dataset}_{name}.yml"
with open(path, "w") as f:
yaml.dump(config, f)
return name, path
def benchmarks():
# Read JSON lines data
with open("benchmarks.json") as f:
data = f.read()
df = pd.read_json(data, lines=True).sort_values(by=["source", "ndcg_cut_10"], ascending=[True, False])
return df[["source", "name", "ndcg_cut_10", "map_cut_10", "recall_10", "P_10", "disk"]].reset_index(drop=True)
# Runs benchmark evaluation
def evaluate(quantize=None):
for dataset in datasets:
# Build config based on requested quantization
name, config = writeconfig(dataset, quantize)
command = f"python benchmarks.py -d beir -s {dataset} -m embeddings -c \"{config}\" -n \"{name}\""
os.system(command)
Establish a baseline
Before introducing vector quantization, let's establish a baseline of accuracy per source without quantization. The following table shows accuracy metrics along with the disk storage size in KB.
evaluate()
benchmarks()
source | name | ndcg_cut_10 | map_cut_10 | recall_10 | P_10 | disk |
arguana | baseline | 0.47886 | 0.38931 | 0.76600 | 0.07660 | 13416 |
nfcorpus | baseline | 0.30893 | 0.10789 | 0.15315 | 0.23622 | 5517 |
scifact | baseline | 0.65273 | 0.60386 | 0.78972 | 0.08867 | 7878 |
Quantization
The two main types of vector quantization are scalar quantization and product quantization.
Scalar quantization maps floating point data to a series of integers. For example, 8-bit quantization splits the range of floats into 255 buckets. This cuts data storage down by 4 when working with 32-bit floats, since each dimension now only stores 1 byte vs 4. A more dramatic version of this is binary or 1-bit quantization, where the floating point range is cut in half, 0 or 1. The trade-off as one would expect is accuracy.
Product quantization is similar in that the process bins a floating point range into codes but it's more complex. This method splits vectors across dimensions into subvectors and runs those subvectors through a clustering algorithm. This can lead to a substantial reduction in data storage at the expense of accuracy like with scalar quantization. The Faiss documentation has a number of great papers with more information on this method.
Quantization is available at the vector processing and datastore levels in txtai. In both cases, it requires an ANN backend that can support integer vectors. Currently, only Faiss, NumPy and Torch are supported.
Let's benchmark a variety of quantization methods.
# Evaluate quantization methods
for quantize in ["SQ1", "SQ4", "SQ8", "PQ48x4fs", "PQ96x4fs", "PQ192x4fs"]:
evaluate(quantize)
# Show benchmarks
benchmarks()
source | name | ndcg_cut_10 | map_cut_10 | recall_10 | P_10 | disk |
arguana | baseline | 0.47886 | 0.38931 | 0.76600 | 0.07660 | 13416 |
arguana | SQ8 | 0.47781 | 0.38781 | 0.76671 | 0.07667 | 3660 |
arguana | SQ4 | 0.47771 | 0.38915 | 0.76174 | 0.07617 | 2034 |
arguana | PQ192x4fs | 0.46322 | 0.37341 | 0.75391 | 0.07539 | 1260 |
arguana | PQ96x4fs | 0.43744 | 0.35052 | 0.71906 | 0.07191 | 844 |
arguana | SQ1 | 0.42604 | 0.33997 | 0.70555 | 0.07055 | 795 |
arguana | PQ48x4fs | 0.40220 | 0.31653 | 0.67852 | 0.06785 | 637 |
nfcorpus | SQ4 | 0.31028 | 0.10758 | 0.15417 | 0.23839 | 751 |
nfcorpus | SQ8 | 0.30917 | 0.10810 | 0.15327 | 0.23591 | 1433 |
nfcorpus | baseline | 0.30893 | 0.10789 | 0.15315 | 0.23622 | 5517 |
nfcorpus | PQ192x4fs | 0.30722 | 0.10678 | 0.15168 | 0.23467 | 433 |
nfcorpus | PQ96x4fs | 0.29594 | 0.09929 | 0.13996 | 0.22693 | 262 |
nfcorpus | SQ1 | 0.26582 | 0.08579 | 0.12658 | 0.19907 | 237 |
nfcorpus | PQ48x4fs | 0.25874 | 0.08100 | 0.11912 | 0.19567 | 177 |
scifact | SQ4 | 0.65299 | 0.60328 | 0.79139 | 0.08867 | 1078 |
scifact | baseline | 0.65273 | 0.60386 | 0.78972 | 0.08867 | 7878 |
scifact | SQ8 | 0.65149 | 0.60193 | 0.78972 | 0.08867 | 2050 |
scifact | PQ192x4fs | 0.64046 | 0.58823 | 0.78933 | 0.08867 | 622 |
scifact | PQ96x4fs | 0.62256 | 0.57773 | 0.74861 | 0.08400 | 375 |
scifact | SQ1 | 0.58724 | 0.53418 | 0.73989 | 0.08267 | 338 |
scifact | PQ48x4fs | 0.52292 | 0.46611 | 0.68744 | 0.07700 | 251 |
Review
Each of the sources above were run through a series of scalar and product quantization settings. The accuracy vs disk space trade off is clear to see.
Couple key points to highlight.
The vector model outputs vectors with 384 dimensions
Scalar quantization (SQ) was evaluated for 1-bit (binary), 4 and 8 bits
1-bit (binary) quantization stores vectors in binary indexes
For product quantization (PQ), three methods were tested. 48, 96 and 192 codes respectively, all using 4-bit codes
In general, the larger the index size, the better the scores. There are a few exceptions to this but the differences are minimal in those cases. The smaller scalar and product quantization indexes are up to 20 times smaller.
It's important to note that the smaller scalar methods typically need a wider number of dimensions to perform competitively. With that being said, even at 384 dimensions, binary quantization still does OK. txtai supports scalar quantization precisions from 1 through 8 bits.
This is just a subset of the available quantization methods available in Faiss. More details can be found in the Faiss documentation.
Wrapping up
This article evaluated a variety of vector quantization methods. Quantization is an option to reduce storage costs at the expense of accuracy. Larger vector models (1024+ dimensions) will retain accuracy better with more aggressive quantization methods. As always, results will vary depending on your data.