Embeddings index format for open data access
Platform and programming language independent data storage with txtai
txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.
The main programming language with txtai is Python. A key tenet is that the underlying data in an embeddings index is accessible without txtai.
This article will demonstrate this through a series of examples.
Install dependencies
Install txtai
and all dependencies.
pip install txtai[graph] datasets sqlite-vec
Load dataset
This example will use the chatgpt-prompts
dataset.
from datasets import load_dataset
dataset = load_dataset("fka/awesome-chatgpt-prompts", split="train")
Build an Embeddings index
Let's first build an embeddings index using txtai.
from txtai import Embeddings
embeddings = Embeddings()
embeddings.index((x["act"], x["prompt"]) for x in dataset)
embeddings.save("txtai-index")
Let's take a look at the index that was created
!ls -l txtai-index
!echo
!file txtai-index/*
total 268
-rw-r--r-- 1 root root 342 Sep 6 15:21 config.json
-rw-r--r-- 1 root root 262570 Sep 6 15:21 embeddings
-rw-r--r-- 1 root root 2988 Sep 6 15:21 ids
txtai-index/config.json: JSON data
txtai-index/embeddings: data
txtai-index/ids: data
The txtai embeddings index format is documented here. Looking at the files above, we have configuration, embeddings data and ids storage. Ids storage is only used when content is disabled.
Let's inspect each file.
import json
with open("txtai-index/config.json") as f:
print(json.dumps(json.load(f), sort_keys=True, indent=2))
{
"backend": "faiss",
"build": {
"create": "2024-09-06T15:21:11Z",
"python": "3.10.12",
"settings": {
"components": "IDMap,Flat"
},
"system": "Linux (x86_64)",
"txtai": "7.5.0"
},
"dimensions": 384,
"offset": 170,
"path": "sentence-transformers/all-MiniLM-L6-v2",
"update": "2024-09-06T15:21:11Z"
}
import faiss
index = faiss.read_index("txtai-index/embeddings")
print(f"Total records {index.ntotal}")
Total records 170
import msgpack
with open("txtai-index/ids", "rb") as f:
print(msgpack.unpack(f)[5:10])
['JavaScript Console', 'Excel Sheet', 'English Pronunciation Helper', 'Spoken English Teacher and Improver', 'Travel Guide']
Each file can be read without txtai. JSON, MessagePack and Faiss all have libraries in multiple programming languages.
Embeddings index with SQLite
In the next example, we'll use SQLite to store content and vectors courtesy of the sqlite-vec library.
from txtai import Embeddings
embeddings = Embeddings(content=True, backend="sqlite")
embeddings.index((x["act"], x["prompt"]) for x in dataset)
embeddings.save("txtai-sqlite")
Let's once again explore the generated index files.
!ls -l txtai-sqlite
!echo
!file txtai-sqlite/*
total 1696
-rw-r--r-- 1 root root 384 Sep 6 15:21 config.json
-rw-r--r-- 1 root root 126976 Sep 6 15:21 documents
-rw-r--r-- 1 root root 1605632 Sep 6 15:21 embeddings
txtai-sqlite/config.json: JSON data
txtai-sqlite/documents: SQLite 3.x database, last written using SQLite version 3037002, file counter 1, database pages 31, cookie 0x1, schema 4, UTF-8, version-valid-for 1
txtai-sqlite/embeddings: SQLite 3.x database, last written using SQLite version 3037002, file counter 1, database pages 392, cookie 0x1, schema 4, UTF-8, version-valid-for 1
This time note how there is a documents file with content stored in SQLite and a separate SQLite file for embeddings. Let's test it out.
embeddings.search("teacher")
[{'id': 'Math Teacher',
'text': 'I want you to act as a math teacher. I will provide some mathematical equations or concepts, and it will be your job to explain them in easy-to-understand terms. This could include providing step-by-step instructions for solving a problem, demonstrating various techniques with visuals or suggesting online resources for further study. My first request is "I need help understanding how probability works."',
'score': 0.3421396017074585},
{'id': 'Educational Content Creator',
'text': 'I want you to act as an educational content creator. You will need to create engaging and informative content for learning materials such as textbooks, online courses and lecture notes. My first suggestion request is "I need help developing a lesson plan on renewable energy sources for high school students."',
'score': 0.3267676830291748},
{'id': 'Philosophy Teacher',
'text': 'I want you to act as a philosophy teacher. I will provide some topics related to the study of philosophy, and it will be your job to explain these concepts in an easy-to-understand manner. This could include providing examples, posing questions or breaking down complex ideas into smaller pieces that are easier to comprehend. My first request is "I need help understanding how different philosophical theories can be applied in everyday life."',
'score': 0.30780404806137085}]
The top N results as expected. Let's again inspect the files.
import json
with open("txtai-sqlite/config.json") as f:
print(json.dumps(json.load(f), sort_keys=True, indent=2))
{
"backend": "sqlite",
"build": {
"create": "2024-09-06T15:21:13Z",
"python": "3.10.12",
"settings": {
"sqlite": "3.37.2",
"sqlite-vec": "v0.1.1"
},
"system": "Linux (x86_64)",
"txtai": "7.5.0"
},
"content": true,
"dimensions": 384,
"offset": 170,
"path": "sentence-transformers/all-MiniLM-L6-v2",
"update": "2024-09-06T15:21:13Z"
}
import sqlite3, sqlite_vec
db = sqlite3.connect("txtai-sqlite/documents")
print(db.execute("SELECT COUNT(*) FROM sections").fetchone()[0])
db = sqlite3.connect("txtai-sqlite/embeddings")
db.enable_load_extension(True)
sqlite_vec.load(db)
print(db.execute("SELECT COUNT(*) FROM vectors").fetchone()[0])
170 170
As in the previous example, each file can be read without txtai. JSON, SQLite and sqlite-vec all have libraries in multiple programming languages.
Graph storage
Starting with txtai 7.4, graphs are stored using MessagePack. The indexed file has a list of nodes and edges that can easily be imported.
from txtai import Embeddings
embeddings = Embeddings(content=True, backend="sqlite", graph={"approximate": False})
embeddings.index((x["act"], x["prompt"]) for x in dataset)
embeddings.save("txtai-graph")
!ls -l txtai-graph
!echo
!file txtai-graph/*
total 1816
-rw-r--r-- 1 root root 454 Sep 6 15:21 config.json
-rw-r--r-- 1 root root 126976 Sep 6 15:21 documents
-rw-r--r-- 1 root root 1605632 Sep 6 15:21 embeddings
-rw-r--r-- 1 root root 119970 Sep 6 15:21 graph
txtai-graph/config.json: JSON data
txtai-graph/documents: SQLite 3.x database, last written using SQLite version 3037002, file counter 1, database pages 31, cookie 0x1, schema 4, UTF-8, version-valid-for 1
txtai-graph/embeddings: SQLite 3.x database, last written using SQLite version 3037002, file counter 1, database pages 392, cookie 0x1, schema 4, UTF-8, version-valid-for 1
txtai-graph/graph: data
import msgpack
with open("txtai-graph/graph", "rb") as f:
data = msgpack.unpack(f)
print(data.keys())
for key in data:
if data[key]:
print(key, data[key][100])
dict_keys(['nodes', 'edges', 'categories', 'topics'])
nodes [100, {'id': 'Ascii Artist', 'text': 'I want you to act as an ascii artist. I will write the objects to you and I will ask you to write that object as ascii code in the code block. Write only ascii code. Do not explain about the object you wrote. I will say the objects in double quotes. My first object is "cat"'}]
edges [5, 100, {'weight': 0.39010339975357056}]
Wrapping up
This article gave an overview of the txtai embeddings index file format and how it supports open data access.
While txtai can be used as an all-in-one embeddings database, it can also be used for only one part of the stack such as data ingestion. For example, it can be used to populate a Postgres or SQLite database for downstream use. The options are there.