Anatomy of a txtai index
Deep dive into the file formats behind a txtai embeddings index
This article inspects the filesystem of a txtai embeddings index and gives an overview of the structure.
Install dependencies
Install txtai
and all dependencies.
pip install txtai
Create index
Let's first create an index to inspect. We'll use the classic txtai example.
from txtai.embeddings import Embeddings
data = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]
# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2", "content": True, "objects": True})
# Create an index for the list of text
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])
# Run a search
embeddings.search("feel good story", 1)
[{'id': '4',
'score': 0.08329004049301147,
'text': 'Maine man wins $1M from $25 lottery ticket'}]
Print index info
Embeddings indexes have an info
method which prints metadata about the index. This can be used to see when the index was build, what settings were used and when it was last updated.
# Print metadata
embeddings.info()
{
"backend": "faiss",
"build": {
"create": "2022-03-02T15:18:41Z",
"python": "3.7.12",
"settings": {
"components": "IDMap,Flat"
},
"system": "Linux (x86_64)",
"txtai": "4.3.0"
},
"content": "sqlite",
"dimensions": 768,
"objects": true,
"offset": 6,
"path": "sentence-transformers/nli-mpnet-base-v2",
"update": "2022-03-02T15:18:41Z"
}
Save index and review file structure
Next let's save the index and review the file structure. This section prints each file, and runs commands to show
# Save the index
embeddings.save("index")
# Show basic details about index files
for f in ["config", "documents", "embeddings"]:
!ls -l "index/{f}"
!xxd "index/{f}" | head -5
!file "index/{f}"
!echo
-rw-r--r-- 1 root root 295 Mar 2 15:18 index/config
00000000: 8004 951c 0100 0000 0000 007d 9428 8c04 ...........}.(..
00000010: 7061 7468 948c 2773 656e 7465 6e63 652d path..'sentence-
00000020: 7472 616e 7366 6f72 6d65 7273 2f6e 6c69 transformers/nli
00000030: 2d6d 706e 6574 2d62 6173 652d 7632 948c -mpnet-base-v2..
00000040: 0763 6f6e 7465 6e74 948c 0673 716c 6974 .content...sqlit
index/config: data
-rw-r--r-- 1 root root 28672 Mar 2 15:18 index/documents
00000000: 5351 4c69 7465 2066 6f72 6d61 7420 3300 SQLite format 3.
00000010: 1000 0101 0040 2020 0000 0001 0000 0007 .....@ ........
00000020: 0000 0000 0000 0000 0000 0001 0000 0004 ................
00000030: 0000 0000 0000 0000 0000 0001 0000 0000 ................
00000040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
index/documents: SQLite 3.x database, last written using SQLite version 3022000
-rw-r--r-- 1 root root 18570 Mar 2 15:18 index/embeddings
00000000: 4978 4d70 0003 0000 0600 0000 0000 0000 IxMp............
00000010: 0000 1000 0000 0000 0000 1000 0000 0000 ................
00000020: 0100 0000 0049 7846 4900 0300 0006 0000 .....IxFI.......
00000030: 0000 0000 0000 0010 0000 0000 0000 0010 ................
00000040: 0000 0000 0001 0000 0000 0012 0000 0000 ................
index/embeddings: data
The directory has three files: config, documents and embeddings.
config - The input configuration passed into the Embeddings object. Serialized with Python's pickle format.
documents - SQLite database. Stores the input text content and associated data.
embeddings - The embeddings index file. This is an Approximate Nearest Neighbor (ANN) index with either Faiss (default), Hnswlib or Annoy, depending on the settings.
Config
Given that the configuration file is serialized with Python pickle, it can be loaded in Python.
import json
import pickle
with open("index/config", "rb") as config:
print(json.dumps(pickle.load(config), sort_keys=True, indent=2))
{
"backend": "faiss",
"build": {
"create": "2022-03-02T15:18:41Z",
"python": "3.7.12",
"settings": {
"components": "IDMap,Flat"
},
"system": "Linux (x86_64)",
"txtai": "4.3.0"
},
"content": "sqlite",
"dimensions": 768,
"objects": true,
"offset": 6,
"path": "sentence-transformers/nli-mpnet-base-v2",
"update": "2022-03-02T15:18:41Z"
}
Notice how this is the same output as embeddings.info()
.
Documents
The documents file is a SQLite database with three tables, documents, objects and sections. Let's take a look inside.
import pandas as pd
import sqlite3
from IPython.display import display, Markdown
# Print details of a txtai SQLite document database
def showdb(path):
db = sqlite3.connect(path)
display(Markdown("## Tables"))
df = pd.read_sql_query("select name FROM sqlite_master where type='table'", db)
display(df.style.hide_index())
for table in df["name"]:
display(Markdown(f"## {table}"))
df = pd.read_sql_query(f"select * from {table}", db)
# Truncate large binary objects
if "object" in df:
df["object"] = df["object"].str.slice(0, 25)
display(df.style.hide_index())
showdb("index/documents")
Tables
name |
documents |
objects |
sections |
documents
id | data | tags | entry |
objects
id | object | tags | entry |
sections
indexid | id | text | tags | entry |
0 | 0 | US tops 5 million confirmed virus cases | None | 2022-03-02 15:18:40.591760 |
1 | 1 | Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg | None | 2022-03-02 15:18:40.591760 |
2 | 2 | Beijing mobilises invasion craft along coast as Taiwan tensions escalate | None | 2022-03-02 15:18:40.591760 |
3 | 3 | The National Park Service warns against sacrificing slower friends in a bear attack | None | 2022-03-02 15:18:40.591760 |
4 | 4 | Maine man wins $1M from $25 lottery ticket | None | 2022-03-02 15:18:40.591760 |
5 | 5 | Make huge profits without work, earn up to $100,000 a day | None | 2022-03-02 15:18:40.591760 |
documents
stores additional text fields as JSON, objects
stores binary content and sections
stores indexed text. The only table with data as of now is sections
. sections
stores the input (id, text, tags) elements along with internal ids and entry dates.
We'll come back to documents
and objects
.
Embeddings
Embeddings is the ANN index and what is queried when running similarity search. The default setting is to use Faiss. Let's inspect!
import faiss
import numpy as np
# Query
query = "feel good story"
# Read index
index = faiss.read_index("index/embeddings")
print(index)
print(f"Total records: {index.ntotal}, dimensions: {index.d}")
print()
# Generate query embeddings and run query
queries = np.array([embeddings.transform((None, query, None))])
scores, ids = index.search(queries, 1)
# Lookup query result from original data array
result = data[ids[0][0]]
# Show results
print("Query:", query)
print("Results:", result, ids, scores)
<faiss.swigfaiss.IndexIDMap; proxy of <Swig Object of type 'faiss::IndexIDMapTemplate< faiss::Index > *' at 0x7f68631cd750> >
Total records: 6, dimensions: 768
Query: feel good story
Results: Maine man wins $1M from $25 lottery ticket [[4]] [[0.08329004]]
Index compression
txtai normally saves index files to a directory. Indexes can also be compressed. Nothing is different other than the files being in an compressed file format vs a directory.
# Save index as tar.xz
embeddings.save("index.tar.xz")
!tar -tvJf index.tar.xz
!echo
!xz -l index.tar.xz
!echo
# Reload index
embeddings.load("index.tar.xz")
# Test search matches
embeddings.search("feel good story", 1)
drwx------ root/root 0 2022-03-02 15:18 ./
-rw-r--r-- root/root 295 2022-03-02 15:18 ./config
-rw-r--r-- root/root 28672 2022-03-02 15:18 ./documents
-rw-r--r-- root/root 18570 2022-03-02 15:18 ./embeddings
Strms Blocks Compressed Uncompressed Ratio Check Filename
1 1 18.1 KiB 50.0 KiB 0.361 CRC64 index.tar.xz
[{'id': '4',
'score': 0.08329004049301147,
'text': 'Maine man wins $1M from $25 lottery ticket'}]
Content storage
Let's add additional metadata and binary content to the index and see how that is stored in the SQLite database.
import urllib
from IPython.display import Image
# Get an image
request = urllib.request.urlopen("https://raw.githubusercontent.com/neuml/txtai/master/demo.gif")
# Get data
data = request.read()
# Upsert new record having both text and an object
embeddings.upsert([("txtai", {"text": "txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.", "size": len(data), "object": data}, None)])
embeddings.save("index")
showdb("index/documents")
Tables
name |
documents |
objects |
sections |
documents
id | data | tags | entry |
txtai | {"text": "txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.", "size": 47189} | None | 2022-03-02 15:19:00.708223 |
objects
id | object | tags | entry |
txtai | b'GIF89a\x9b\x04\x18\x03\xf5\x00\x00\x12\x13\x14\xcc\xcc\xcc\x13\x14\x15\xbd\xbd\xbd' | None | 2022-03-02 15:19:00.708223 |
sections
indexid | id | text | tags | entry |
0 | 0 | US tops 5 million confirmed virus cases | None | 2022-03-02 15:18:40.591760 |
1 | 1 | Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg | None | 2022-03-02 15:18:40.591760 |
2 | 2 | Beijing mobilises invasion craft along coast as Taiwan tensions escalate | None | 2022-03-02 15:18:40.591760 |
3 | 3 | The National Park Service warns against sacrificing slower friends in a bear attack | None | 2022-03-02 15:18:40.591760 |
4 | 4 | Maine man wins $1M from $25 lottery ticket | None | 2022-03-02 15:18:40.591760 |
5 | 5 | Make huge profits without work, earn up to $100,000 a day | None | 2022-03-02 15:18:40.591760 |
6 | txtai | txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications. | None | 2022-03-02 15:19:00.708223 |
This section added a new record with metadata and binary content (truncated when printed here). The documents
table enables additional fielded search with SQL.
embeddings.search("select * from txtai where size > 0")
[{'data': '{"text": "txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.", "size": 47189}',
'entry': '2022-03-02 15:19:00.708223',
'id': 'txtai',
'indexid': 6,
'object': <_io.BytesIO at 0x7f6861408a70>,
'score': None,
'tags': None,
'text': 'txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.'}]
Metadata fields can also be selected and combined with similarity queries.
embeddings.search("select text, size, score from txtai where similar('machine learning') and score > 0.25 and size > 0")
[{'score': 0.5479326844215393,
'size': 47189,
'text': 'txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.'}]
The objects
table enables additional binary content to be stored alongside an embeddings index. In some cases (image search), the object content is used to build embeddings.
Otherwise, it's the text field from sections. In both cases, associated binary objects are available at search time.
embeddings.search("select object from txtai where object is not null")
[{'object': <_io.BytesIO at 0x7f6863246470>}]
Wrapping up
This article gave an overview of the txtai embeddings index file format. This hopefully gives a basic understanding of the architecture and/or helps with debugging when running into issues.
See the following links for more information.