Embeddings index format for open data access

Embeddings index format for open data access

Platform and programming language independent data storage with txtai

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

The main programming language with txtai is Python. A key tenet is that the underlying data in an embeddings index is accessible without txtai.

This article will demonstrate this through a series of examples.

Install dependencies

Install txtai and all dependencies.

pip install txtai[graph] datasets sqlite-vec

Load dataset

This example will use the chatgpt-prompts dataset.

from datasets import load_dataset

dataset = load_dataset("fka/awesome-chatgpt-prompts", split="train")

Build an Embeddings index

Let's first build an embeddings index using txtai.

from txtai import Embeddings

embeddings = Embeddings()
embeddings.index((x["act"], x["prompt"]) for x in dataset)
embeddings.save("txtai-index")

Let's take a look at the index that was created

!ls -l txtai-index
!echo
!file txtai-index/*
total 268
-rw-r--r-- 1 root root    342 Sep  6 15:21 config.json
-rw-r--r-- 1 root root 262570 Sep  6 15:21 embeddings
-rw-r--r-- 1 root root   2988 Sep  6 15:21 ids

txtai-index/config.json: JSON data
txtai-index/embeddings:  data
txtai-index/ids:         data

The txtai embeddings index format is documented here. Looking at the files above, we have configuration, embeddings data and ids storage. Ids storage is only used when content is disabled.

Let's inspect each file.

import json

with open("txtai-index/config.json") as f:
  print(json.dumps(json.load(f), sort_keys=True, indent=2))
{
  "backend": "faiss",
  "build": {
    "create": "2024-09-06T15:21:11Z",
    "python": "3.10.12",
    "settings": {
      "components": "IDMap,Flat"
    },
    "system": "Linux (x86_64)",
    "txtai": "7.5.0"
  },
  "dimensions": 384,
  "offset": 170,
  "path": "sentence-transformers/all-MiniLM-L6-v2",
  "update": "2024-09-06T15:21:11Z"
}
import faiss

index = faiss.read_index("txtai-index/embeddings")
print(f"Total records {index.ntotal}")

Total records 170

import msgpack

with open("txtai-index/ids", "rb") as f:
  print(msgpack.unpack(f)[5:10])
['JavaScript Console', 'Excel Sheet', 'English Pronunciation Helper', 'Spoken English Teacher and Improver', 'Travel Guide']

Each file can be read without txtai. JSON, MessagePack and Faiss all have libraries in multiple programming languages.

Embeddings index with SQLite

In the next example, we'll use SQLite to store content and vectors courtesy of the sqlite-vec library.

from txtai import Embeddings

embeddings = Embeddings(content=True, backend="sqlite")
embeddings.index((x["act"], x["prompt"]) for x in dataset)
embeddings.save("txtai-sqlite")

Let's once again explore the generated index files.

!ls -l txtai-sqlite
!echo
!file txtai-sqlite/*
total 1696
-rw-r--r-- 1 root root     384 Sep  6 15:21 config.json
-rw-r--r-- 1 root root  126976 Sep  6 15:21 documents
-rw-r--r-- 1 root root 1605632 Sep  6 15:21 embeddings

txtai-sqlite/config.json: JSON data
txtai-sqlite/documents:   SQLite 3.x database, last written using SQLite version 3037002, file counter 1, database pages 31, cookie 0x1, schema 4, UTF-8, version-valid-for 1
txtai-sqlite/embeddings:  SQLite 3.x database, last written using SQLite version 3037002, file counter 1, database pages 392, cookie 0x1, schema 4, UTF-8, version-valid-for 1

This time note how there is a documents file with content stored in SQLite and a separate SQLite file for embeddings. Let's test it out.

embeddings.search("teacher")
[{'id': 'Math Teacher',
  'text': 'I want you to act as a math teacher. I will provide some mathematical equations or concepts, and it will be your job to explain them in easy-to-understand terms. This could include providing step-by-step instructions for solving a problem, demonstrating various techniques with visuals or suggesting online resources for further study. My first request is "I need help understanding how probability works."',
  'score': 0.3421396017074585},
 {'id': 'Educational Content Creator',
  'text': 'I want you to act as an educational content creator. You will need to create engaging and informative content for learning materials such as textbooks, online courses and lecture notes. My first suggestion request is "I need help developing a lesson plan on renewable energy sources for high school students."',
  'score': 0.3267676830291748},
 {'id': 'Philosophy Teacher',
  'text': 'I want you to act as a philosophy teacher. I will provide some topics related to the study of philosophy, and it will be your job to explain these concepts in an easy-to-understand manner. This could include providing examples, posing questions or breaking down complex ideas into smaller pieces that are easier to comprehend. My first request is "I need help understanding how different philosophical theories can be applied in everyday life."',
  'score': 0.30780404806137085}]

The top N results as expected. Let's again inspect the files.

import json

with open("txtai-sqlite/config.json") as f:
  print(json.dumps(json.load(f), sort_keys=True, indent=2))
{
  "backend": "sqlite",
  "build": {
    "create": "2024-09-06T15:21:13Z",
    "python": "3.10.12",
    "settings": {
      "sqlite": "3.37.2",
      "sqlite-vec": "v0.1.1"
    },
    "system": "Linux (x86_64)",
    "txtai": "7.5.0"
  },
  "content": true,
  "dimensions": 384,
  "offset": 170,
  "path": "sentence-transformers/all-MiniLM-L6-v2",
  "update": "2024-09-06T15:21:13Z"
}
import sqlite3, sqlite_vec

db = sqlite3.connect("txtai-sqlite/documents")
print(db.execute("SELECT COUNT(*) FROM sections").fetchone()[0])

db = sqlite3.connect("txtai-sqlite/embeddings")
db.enable_load_extension(True)
sqlite_vec.load(db)
print(db.execute("SELECT COUNT(*) FROM vectors").fetchone()[0])

170 170

As in the previous example, each file can be read without txtai. JSON, SQLite and sqlite-vec all have libraries in multiple programming languages.

Graph storage

Starting with txtai 7.4, graphs are stored using MessagePack. The indexed file has a list of nodes and edges that can easily be imported.

from txtai import Embeddings

embeddings = Embeddings(content=True, backend="sqlite", graph={"approximate": False})
embeddings.index((x["act"], x["prompt"]) for x in dataset)
embeddings.save("txtai-graph")
!ls -l txtai-graph
!echo
!file txtai-graph/*
total 1816
-rw-r--r-- 1 root root     454 Sep  6 15:21 config.json
-rw-r--r-- 1 root root  126976 Sep  6 15:21 documents
-rw-r--r-- 1 root root 1605632 Sep  6 15:21 embeddings
-rw-r--r-- 1 root root  119970 Sep  6 15:21 graph

txtai-graph/config.json: JSON data
txtai-graph/documents:   SQLite 3.x database, last written using SQLite version 3037002, file counter 1, database pages 31, cookie 0x1, schema 4, UTF-8, version-valid-for 1
txtai-graph/embeddings:  SQLite 3.x database, last written using SQLite version 3037002, file counter 1, database pages 392, cookie 0x1, schema 4, UTF-8, version-valid-for 1
txtai-graph/graph:       data
import msgpack

with open("txtai-graph/graph", "rb") as f:
  data = msgpack.unpack(f)
  print(data.keys())

  for key in data:
    if data[key]:
      print(key, data[key][100])
dict_keys(['nodes', 'edges', 'categories', 'topics'])
nodes [100, {'id': 'Ascii Artist', 'text': 'I want you to act as an ascii artist. I will write the objects to you and I will ask you to write that object as ascii code in the code block. Write only ascii code. Do not explain about the object you wrote. I will say the objects in double quotes. My first object is "cat"'}]
edges [5, 100, {'weight': 0.39010339975357056}]

Wrapping up

This article gave an overview of the txtai embeddings index file format and how it supports open data access.

While txtai can be used as an all-in-one embeddings database, it can also be used for only one part of the stack such as data ingestion. For example, it can be used to populate a Postgres or SQLite database for downstream use. The options are there.