All about vector quantization

All about vector quantization

Benchmarking scalar and product quantization methods

Featured on Hashnode

txtai supports a number of approximate nearest neighbor (ANN) libraries for vector storage. This includes Faiss, Hnswlib, Annoy, NumPy and PyTorch. Custom implementations can also be added.

The default ANN for txtai is Faiss. Faiss has by far the largest array of configurable options in building an ANN index. This article will cover quantization and different approaches that are possible along with the tradeoffs.

Install dependencies

Install txtai and all dependencies.

# Install txtai
pip install txtai pytrec_eval rank-bm25 elasticsearch psutil

Preparing the datasets

First, let's download a subset of the datasets from the BEIR evaluation framework. We'll also retrieve the standard txtai benchmark script. These will be used to help judge the accuracy of quantization methods.

import os

# Get benchmarks script
os.system("wget https://raw.githubusercontent.com/neuml/txtai/master/examples/benchmarks.py")

# Create output directory
os.makedirs("beir", exist_ok=True)

if os.path.exists("benchmarks.json"):
  os.remove("benchmarks.json")

# Download subset of BEIR datasets
datasets = ["nfcorpus", "arguana", "scifact"]
for dataset in datasets:
  url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset}.zip"
  os.system(f"wget {url}")
  os.system(f"mv {dataset}.zip beir")
  os.system(f"unzip -d beir beir/{dataset}.zip")

Evaluation

Next, we'll setup the scaffolding to run evaluations.

import pandas as pd
import yaml

def writeconfig(dataset, quantize):
  sources = {"arguana": "IVF11", "nfcorpus": "IDMap", "scifact": "IVF6"}
  config = {
    "embeddings": {
      "batch": 8192,
      "encodebatch": 128,
      "faiss": {
          "sample": 0.05
      }
    }
  }

  if quantize and quantize[-1].isdigit() and int(quantize[-1]) < 4:
    # Use vector quantization for 1, 2 and 3 bit quantization
    config["embeddings"]["quantize"] = int(quantize[-1])
  elif quantize:
    # Use Faiss quantization for other forms of quantization
    config["embeddings"]["faiss"]["components"] = f"{sources[dataset]},{quantize}"

  # Derive name
  name = quantize if quantize else "baseline"

  # Derive config path and write output
  path = f"{dataset}_{name}.yml"
  with open(path, "w") as f:
    yaml.dump(config, f)

  return name, path

def benchmarks():
  # Read JSON lines data
  with open("benchmarks.json") as f:
    data = f.read()

  df = pd.read_json(data, lines=True).sort_values(by=["source", "ndcg_cut_10"], ascending=[True, False])
  return df[["source", "name", "ndcg_cut_10", "map_cut_10", "recall_10", "P_10", "disk"]].reset_index(drop=True)

# Runs benchmark evaluation
def evaluate(quantize=None):
  for dataset in datasets:
    # Build config based on requested quantization
    name, config = writeconfig(dataset, quantize)

    command = f"python benchmarks.py -d beir -s {dataset} -m embeddings -c \"{config}\" -n \"{name}\""
    os.system(command)

Establish a baseline

Before introducing vector quantization, let's establish a baseline of accuracy per source without quantization. The following table shows accuracy metrics along with the disk storage size in KB.

evaluate()
benchmarks()
sourcenamendcg_cut_10map_cut_10recall_10P_10disk
arguanabaseline0.478860.389310.766000.0766013416
nfcorpusbaseline0.308930.107890.153150.236225517
scifactbaseline0.652730.603860.789720.088677878

Quantization

The two main types of vector quantization are scalar quantization and product quantization.

Scalar quantization maps floating point data to a series of integers. For example, 8-bit quantization splits the range of floats into 255 buckets. This cuts data storage down by 4 when working with 32-bit floats, since each dimension now only stores 1 byte vs 4. A more dramatic version of this is binary or 1-bit quantization, where the floating point range is cut in half, 0 or 1. The trade-off as one would expect is accuracy.

Product quantization is similar in that the process bins a floating point range into codes but it's more complex. This method splits vectors across dimensions into subvectors and runs those subvectors through a clustering algorithm. This can lead to a substantial reduction in data storage at the expense of accuracy like with scalar quantization. The Faiss documentation has a number of great papers with more information on this method.

Quantization is available at the vector processing and datastore levels in txtai. In both cases, it requires an ANN backend that can support integer vectors. Currently, only Faiss, NumPy and Torch are supported.

Let's benchmark a variety of quantization methods.

# Evaluate quantization methods
for quantize in ["SQ1", "SQ4", "SQ8", "PQ48x4fs", "PQ96x4fs", "PQ192x4fs"]:
  evaluate(quantize)

# Show benchmarks
benchmarks()
sourcenamendcg_cut_10map_cut_10recall_10P_10disk
arguanabaseline0.478860.389310.766000.0766013416
arguanaSQ80.477810.387810.766710.076673660
arguanaSQ40.477710.389150.761740.076172034
arguanaPQ192x4fs0.463220.373410.753910.075391260
arguanaPQ96x4fs0.437440.350520.719060.07191844
arguanaSQ10.426040.339970.705550.07055795
arguanaPQ48x4fs0.402200.316530.678520.06785637
nfcorpusSQ40.310280.107580.154170.23839751
nfcorpusSQ80.309170.108100.153270.235911433
nfcorpusbaseline0.308930.107890.153150.236225517
nfcorpusPQ192x4fs0.307220.106780.151680.23467433
nfcorpusPQ96x4fs0.295940.099290.139960.22693262
nfcorpusSQ10.265820.085790.126580.19907237
nfcorpusPQ48x4fs0.258740.081000.119120.19567177
scifactSQ40.652990.603280.791390.088671078
scifactbaseline0.652730.603860.789720.088677878
scifactSQ80.651490.601930.789720.088672050
scifactPQ192x4fs0.640460.588230.789330.08867622
scifactPQ96x4fs0.622560.577730.748610.08400375
scifactSQ10.587240.534180.739890.08267338
scifactPQ48x4fs0.522920.466110.687440.07700251

Review

Each of the sources above were run through a series of scalar and product quantization settings. The accuracy vs disk space trade off is clear to see.

Couple key points to highlight.

  • The vector model outputs vectors with 384 dimensions

  • Scalar quantization (SQ) was evaluated for 1-bit (binary), 4 and 8 bits

  • 1-bit (binary) quantization stores vectors in binary indexes

  • For product quantization (PQ), three methods were tested. 48, 96 and 192 codes respectively, all using 4-bit codes

In general, the larger the index size, the better the scores. There are a few exceptions to this but the differences are minimal in those cases. The smaller scalar and product quantization indexes are up to 20 times smaller.

It's important to note that the smaller scalar methods typically need a wider number of dimensions to perform competitively. With that being said, even at 384 dimensions, binary quantization still does OK. txtai supports scalar quantization precisions from 1 through 8 bits.

This is just a subset of the available quantization methods available in Faiss. More details can be found in the Faiss documentation.

Wrapping up

This article evaluated a variety of vector quantization methods. Quantization is an option to reduce storage costs at the expense of accuracy. Larger vector models (1024+ dimensions) will retain accuracy better with more aggressive quantization methods. As always, results will vary depending on your data.