Anatomy of a txtai index

Anatomy of a txtai index

Deep dive into the file formats behind a txtai embeddings index

This article inspects the filesystem of a txtai embeddings index and gives an overview of the structure.

Install dependencies

Install txtai and all dependencies.

pip install txtai

Create index

Let's first create an index to inspect. We'll use the classic txtai example.

from txtai.embeddings import Embeddings

data = ["US tops 5 million confirmed virus cases",
        "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
        "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
        "The National Park Service warns against sacrificing slower friends in a bear attack",
        "Maine man wins $1M from $25 lottery ticket",
        "Make huge profits without work, earn up to $100,000 a day"]

# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2", "content": True, "objects": True})

# Create an index for the list of text
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])

# Run a search
embeddings.search("feel good story", 1)
[{'id': '4',
  'score': 0.08329004049301147,
  'text': 'Maine man wins $1M from $25 lottery ticket'}]

Print index info

Embeddings indexes have an info method which prints metadata about the index. This can be used to see when the index was build, what settings were used and when it was last updated.

# Print metadata
embeddings.info()
{
  "backend": "faiss",
  "build": {
    "create": "2022-03-02T15:18:41Z",
    "python": "3.7.12",
    "settings": {
      "components": "IDMap,Flat"
    },
    "system": "Linux (x86_64)",
    "txtai": "4.3.0"
  },
  "content": "sqlite",
  "dimensions": 768,
  "objects": true,
  "offset": 6,
  "path": "sentence-transformers/nli-mpnet-base-v2",
  "update": "2022-03-02T15:18:41Z"
}

Save index and review file structure

Next let's save the index and review the file structure. This section prints each file, and runs commands to show

# Save the index
embeddings.save("index")

# Show basic details about index files
for f in ["config", "documents", "embeddings"]:
  !ls -l "index/{f}"
  !xxd "index/{f}" | head -5
  !file "index/{f}"
  !echo
-rw-r--r-- 1 root root 295 Mar  2 15:18 index/config
00000000: 8004 951c 0100 0000 0000 007d 9428 8c04  ...........}.(..
00000010: 7061 7468 948c 2773 656e 7465 6e63 652d  path..'sentence-
00000020: 7472 616e 7366 6f72 6d65 7273 2f6e 6c69  transformers/nli
00000030: 2d6d 706e 6574 2d62 6173 652d 7632 948c  -mpnet-base-v2..
00000040: 0763 6f6e 7465 6e74 948c 0673 716c 6974  .content...sqlit
index/config: data

-rw-r--r-- 1 root root 28672 Mar  2 15:18 index/documents
00000000: 5351 4c69 7465 2066 6f72 6d61 7420 3300  SQLite format 3.
00000010: 1000 0101 0040 2020 0000 0001 0000 0007  .....@  ........
00000020: 0000 0000 0000 0000 0000 0001 0000 0004  ................
00000030: 0000 0000 0000 0000 0000 0001 0000 0000  ................
00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
index/documents: SQLite 3.x database, last written using SQLite version 3022000

-rw-r--r-- 1 root root 18570 Mar  2 15:18 index/embeddings
00000000: 4978 4d70 0003 0000 0600 0000 0000 0000  IxMp............
00000010: 0000 1000 0000 0000 0000 1000 0000 0000  ................
00000020: 0100 0000 0049 7846 4900 0300 0006 0000  .....IxFI.......
00000030: 0000 0000 0000 0010 0000 0000 0000 0010  ................
00000040: 0000 0000 0001 0000 0000 0012 0000 0000  ................
index/embeddings: data

The directory has three files: config, documents and embeddings.

Config

Given that the configuration file is serialized with Python pickle, it can be loaded in Python.

import json
import pickle

with open("index/config", "rb") as config:
  print(json.dumps(pickle.load(config), sort_keys=True, indent=2))
{
  "backend": "faiss",
  "build": {
    "create": "2022-03-02T15:18:41Z",
    "python": "3.7.12",
    "settings": {
      "components": "IDMap,Flat"
    },
    "system": "Linux (x86_64)",
    "txtai": "4.3.0"
  },
  "content": "sqlite",
  "dimensions": 768,
  "objects": true,
  "offset": 6,
  "path": "sentence-transformers/nli-mpnet-base-v2",
  "update": "2022-03-02T15:18:41Z"
}

Notice how this is the same output as embeddings.info().

Documents

The documents file is a SQLite database with three tables, documents, objects and sections. Let's take a look inside.

import pandas as pd
import sqlite3

from IPython.display import display, Markdown

# Print details of a txtai SQLite document database
def showdb(path):
  db = sqlite3.connect(path)

  display(Markdown("## Tables"))
  df = pd.read_sql_query("select name FROM sqlite_master where type='table'", db)
  display(df.style.hide_index())

  for table in df["name"]:
    display(Markdown(f"## {table}"))
    df = pd.read_sql_query(f"select * from {table}", db)

    # Truncate large binary objects
    if "object" in df:
      df["object"] = df["object"].str.slice(0, 25)

    display(df.style.hide_index())

showdb("index/documents")

Tables

name
documents
objects
sections

documents

iddatatagsentry

objects

idobjecttagsentry

sections

indexididtexttagsentry
00US tops 5 million confirmed virus casesNone2022-03-02 15:18:40.591760
11Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized icebergNone2022-03-02 15:18:40.591760
22Beijing mobilises invasion craft along coast as Taiwan tensions escalateNone2022-03-02 15:18:40.591760
33The National Park Service warns against sacrificing slower friends in a bear attackNone2022-03-02 15:18:40.591760
44Maine man wins $1M from $25 lottery ticketNone2022-03-02 15:18:40.591760
55Make huge profits without work, earn up to $100,000 a dayNone2022-03-02 15:18:40.591760

documents stores additional text fields as JSON, objects stores binary content and sections stores indexed text. The only table with data as of now is sections. sections stores the input (id, text, tags) elements along with internal ids and entry dates.

We'll come back to documents and objects.

Embeddings

Embeddings is the ANN index and what is queried when running similarity search. The default setting is to use Faiss. Let's inspect!

import faiss
import numpy as np

# Query
query = "feel good story"

# Read index
index = faiss.read_index("index/embeddings")
print(index)
print(f"Total records: {index.ntotal}, dimensions: {index.d}")
print()

# Generate query embeddings and run query
queries = np.array([embeddings.transform((None, query, None))])
scores, ids = index.search(queries, 1)

# Lookup query result from original data array
result = data[ids[0][0]]

# Show results
print("Query:", query)
print("Results:", result, ids, scores)
<faiss.swigfaiss.IndexIDMap; proxy of <Swig Object of type 'faiss::IndexIDMapTemplate< faiss::Index > *' at 0x7f68631cd750> >
Total records: 6, dimensions: 768

Query: feel good story
Results: Maine man wins $1M from $25 lottery ticket [[4]] [[0.08329004]]

Index compression

txtai normally saves index files to a directory. Indexes can also be compressed. Nothing is different other than the files being in an compressed file format vs a directory.

# Save index as tar.xz
embeddings.save("index.tar.xz")
!tar -tvJf index.tar.xz
!echo
!xz -l index.tar.xz
!echo

# Reload index
embeddings.load("index.tar.xz")

# Test search matches
embeddings.search("feel good story", 1)
drwx------ root/root         0 2022-03-02 15:18 ./
-rw-r--r-- root/root       295 2022-03-02 15:18 ./config
-rw-r--r-- root/root     28672 2022-03-02 15:18 ./documents
-rw-r--r-- root/root     18570 2022-03-02 15:18 ./embeddings

Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename
    1       1     18.1 KiB     50.0 KiB  0.361  CRC64   index.tar.xz

[{'id': '4',
  'score': 0.08329004049301147,
  'text': 'Maine man wins $1M from $25 lottery ticket'}]

Content storage

Let's add additional metadata and binary content to the index and see how that is stored in the SQLite database.

import urllib

from IPython.display import Image

# Get an image
request = urllib.request.urlopen("https://raw.githubusercontent.com/neuml/txtai/master/demo.gif")

# Get data
data = request.read()

# Upsert new record having both text and an object
embeddings.upsert([("txtai", {"text": "txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.", "size": len(data), "object": data}, None)])

embeddings.save("index")
showdb("index/documents")

Tables

name
documents
objects
sections

documents

iddatatagsentry
txtai{"text": "txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.", "size": 47189}None2022-03-02 15:19:00.708223

objects

idobjecttagsentry
txtaib'GIF89a\x9b\x04\x18\x03\xf5\x00\x00\x12\x13\x14\xcc\xcc\xcc\x13\x14\x15\xbd\xbd\xbd'None2022-03-02 15:19:00.708223

sections

indexididtexttagsentry
00US tops 5 million confirmed virus casesNone2022-03-02 15:18:40.591760
11Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized icebergNone2022-03-02 15:18:40.591760
22Beijing mobilises invasion craft along coast as Taiwan tensions escalateNone2022-03-02 15:18:40.591760
33The National Park Service warns against sacrificing slower friends in a bear attackNone2022-03-02 15:18:40.591760
44Maine man wins $1M from $25 lottery ticketNone2022-03-02 15:18:40.591760
55Make huge profits without work, earn up to $100,000 a dayNone2022-03-02 15:18:40.591760
6txtaitxtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.None2022-03-02 15:19:00.708223

This section added a new record with metadata and binary content (truncated when printed here). The documents table enables additional fielded search with SQL.

embeddings.search("select * from txtai where size > 0")
[{'data': '{"text": "txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.", "size": 47189}',
  'entry': '2022-03-02 15:19:00.708223',
  'id': 'txtai',
  'indexid': 6,
  'object': <_io.BytesIO at 0x7f6861408a70>,
  'score': None,
  'tags': None,
  'text': 'txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.'}]

Metadata fields can also be selected and combined with similarity queries.

embeddings.search("select text, size, score from txtai where similar('machine learning') and score > 0.25 and size > 0")
[{'score': 0.5479326844215393,
  'size': 47189,
  'text': 'txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.'}]

The objects table enables additional binary content to be stored alongside an embeddings index. In some cases (image search), the object content is used to build embeddings.

Otherwise, it's the text field from sections. In both cases, associated binary objects are available at search time.

embeddings.search("select object from txtai where object is not null")
[{'object': <_io.BytesIO at 0x7f6863246470>}]

Wrapping up

This article gave an overview of the txtai embeddings index file format. This hopefully gives a basic understanding of the architecture and/or helps with debugging when running into issues.

See the following links for more information.