Integrate txtai with Postgres

Integrate txtai with Postgres

Persist content, vectors and graph data in Postgres

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

The default persistence methods for txtai are local and file-based. SQLite for content, Faiss for vectors and NetworkX for graph data. The main value add of txtai is getting up and running quickly with a minimal amount of external dependencies.

Another key feature of txtai is being able to quickly move from prototyping to production. This article will demonstrate how txtai can integrate with Postgres, a powerful, production-ready and open source object-relational database system. After txtai persists content to Postgres, we'll show it can be directly queried with SQL from any Postgres client

Install dependencies

Install txtai and all dependencies.

# Install txtai
pip install txtai[ann,database,graph]

Install Postgres

Next, we'll install Postgres and start a Postgres instance. This will also install the pgvector extension to enable vector search and storage.

Note: With local environments, consider running Postgres as a Docker container.

# Install Postgres and pgvector
!apt-get update && apt install postgresql postgresql-server-dev-14
!git clone --branch v0.6.2 https://github.com/pgvector/pgvector.git
!cd pgvector && make && make install

# Start database
!service postgresql start
!sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'pass';"

Build an Embeddings database

First, we'll load the 100K most popular articles from Wikipedia. This example will store vectors and content in Postgres.

from txtai import Embeddings

# URL set in code for demo purposes. Use environment variables in production.
url = "postgresql+psycopg2://postgres:pass@localhost/postgres"

# Create embeddings
embeddings = Embeddings(
    content=url,
    backend="pgvector",
    pgvector={
        "url": url
    }
)

# Load dataset
wikipedia = Embeddings()
wikipedia.load(provider="huggingface-hub", container="neuml/txtai-wikipedia")

query = """
SELECT id, text FROM txtai
order by percentile desc
LIMIT 100000
"""

# Index dataset
embeddings.index(wikipedia.search(query))
embeddings.search("Tell me about a mythical horse", 1)
[{'id': 'Unicorn',
  'text': 'The unicorn is a legendary creature that has been described since antiquity as a beast with a single large, pointed, spiraling horn projecting from its forehead.',
  'score': 0.6493861675262451}]
embeddings.search("What is the main ingredient in Ketchup?", 1)
[{'id': 'Ketchup',
  'text': 'Ketchup or catsup is a table condiment with a sweet and sour flavor. The unmodified term ("ketchup") now typically refers to tomato ketchup, although early recipes for various different varieties of ketchup contained mushrooms, oysters, mussels, egg whites, grapes or walnuts, among other ingredients. ',
  'score': 0.6998806595802307}]

Now, let's explore how the data is stored in further detail. We'll save the database and take a look.

%env TOKENIZERS_PARALLELISM=false

# Commit results to the database
embeddings.save("test")
embeddings.close()

!ls test
!cat test/config.json
config.json
{
  "content": "postgresql+psycopg2://postgres:pass@localhost/postgres",
  "backend": "pgvector",
  "pgvector": {
    "url": "postgresql+psycopg2://postgres:pass@localhost/postgres"
  },
  "path": "sentence-transformers/all-MiniLM-L6-v2",
  "dimensions": 384,
  "offset": 100000
}

Note how there is only a configuration file with the database details. Let's explore the Postgres database.

!PGPASSWORD=pass psql -h localhost -U postgres -c "SELECT * from sections order by indexid LIMIT 1"
!PGPASSWORD=pass psql -h localhost -U postgres -c "SELECT * from vectors order by indexid LIMIT 1"
 indexid |    id     |                    text                     | tags |           entry            
---------+-----------+---------------------------------------------+------+----------------------------
       0 | Main Page | Welcome to Wikipedia,                      +|      | 2024-04-24 17:02:55.335643
         |           | the free encyclopedia that anyone can edit.+|      | 
         |           |  articles in English                        |      | 
(1 row)

indexid  | embedding
---------+----------------------------------------------------------------------------------------------
       0 | [0.006224656,0.010993197,-0.020153109,0.046154346,-0.0067520724,0.011807963,-0.014358369, ...
(1 row)

Note how the text content and vectors are stored in this Postgres instance.

Graph Embeddings

Next, we'll rebuild the same embeddings database and enable a graph component. This graph component will also persist content to Postgres.

from txtai import Embeddings

# Create embeddings
embeddings = Embeddings(
    content=url,
    backend="pgvector",
    pgvector={
        "url": url
    },
    graph={
        "backend": "rdbms",
        "url": url,
        "approximate": False,
    }
)

# Index dataset
embeddings.index(wikipedia.search(query))

Now, as with prior graph examples, let's build a new graph with a query. Then we'll plot that subgraph.

g = embeddings.graph.search("""
MATCH P=({id: "Roman Empire"})-[*1..3]->({id: "Saxons"})-[*1..3]->({id: "Vikings"})-[*1..3]->({id: "Battle of Hastings"})
RETURN P
LIMIT 20
""", graph=True)
import matplotlib.pyplot as plt
import networkx as nx

def plot(graph):
    labels = {x: f"{graph.attribute(x, 'id')}" for x in graph.scan()}
    colors = ["#D32F2F", "#0277bd", "#7e57c2", "#757575"]

    results = embeddings.batchsimilarity(labels.values(), ["Roman Empire", "Germanic Barbarians", "Viking conquest and siege", "Norman Conquest of England"])
    colors = [colors[x[0][0]] for x in results]

    options = {
        "node_size": 2000,
        "node_color": colors,
        "edge_color": "#454545",
        "font_color": "#efefef",
        "font_size": 11,
        "alpha": 1.0,
    }

    fig, ax = plt.subplots(figsize=(20, 9))
    pos = nx.spring_layout(graph.backend, seed=512, k=0.9, iterations=50)
    nx.draw_networkx(graph.backend, pos=pos, labels=labels, **options)
    ax.set_facecolor("#303030")
    ax.axis("off")
    fig.set_facecolor("#303030")

    plt.show()

plot(g)

🚀 Very exciting. A full graph embeddings database all backed by Postgres!

Let's explore the database once again.

embeddings.save("test")
!PGPASSWORD=pass psql -h localhost -U postgres -c "SELECT * from sections order by indexid LIMIT 1"
!PGPASSWORD=pass psql -h localhost -U postgres -c "SELECT * from vectors order by indexid LIMIT 1"
!PGPASSWORD=pass psql -h localhost -U postgres -c "SELECT * from nodes LIMIT 1"
!PGPASSWORD=pass psql -h localhost -U postgres -c "SELECT * from edges LIMIT 1"
 indexid |    id     |                    text                     | tags |           entry            
---------+-----------+---------------------------------------------+------+----------------------------
       0 | Main Page | Welcome to Wikipedia,                      +|      | 2024-04-24 17:02:55.335643
         |           | the free encyclopedia that anyone can edit.+|      | 
         |           |  articles in English                        |      | 
(1 row)

indexid  | embedding
---------+----------------------------------------------------------------------------------------------
       0 | [0.006224656,0.010993197,-0.020153109,0.046154346,-0.0067520724,0.011807963,-0.014358369, ...
(1 row)

 ID |                                                        _metadata                                                        
----+-------------------------------------------------------------------------------------------------------------------------
 0  | {"id": "Main Page", "data": "Welcome to Wikipedia,\nthe free encyclopedia that anyone can edit.\n articles in English"}
(1 row)

    ID    |           _metadata            | Source | Target 
----------+--------------------------------+--------+--------
 __0__179 | {"weight": 0.7411725521087646} | 0      | 179
(1 row)

As we saw before, text content and vectors are stored in the database. Additionally, we now have graph node and edge data.

Let's take this up a level.

Vector search with Postgres

Now that the data is persisted in Postgres, can we run vector search without loading the txtai database locally? Yes!

# Query with a search string
query = str(list(embeddings.transform("Roman Empire")))
query = f"""
SELECT id, (embedding <#> '{query}') * -1 AS score, text FROM sections s \
JOIN vectors v ON s.indexid = v.indexid \
ORDER by score desc LIMIT 5
"""

!PGPASSWORD=pass psql -h localhost -U postgres -c "{query}"
             id              |       score        |                                                                                                                                                                                                                                                                                                            text                                                                                                                                                                                                                                                                                                            
-----------------------------+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Roman Empire                | 0.6798915266990662 | The Roman Empire was the post-Republican state of ancient Rome and is generally understood to mean the period and territory ruled by the Romans following Octavian's assumption of sole rule under the Principate in 31 BC. It included territory in Europe, North Africa, and Western Asia, and was ruled by emperors. The fall of the Western Roman Empire in 476 conventionally marks the end of classical antiquity and the beginning of the Middle Ages.
 History of the Roman Empire | 0.6295346617698669 | The history of the Roman Empire covers the history of ancient Rome from the fall of the Roman Republic in 27 BC until the abdication of Romulus Augustulus in AD 476 in the West, and the Fall of Constantinople in the East in AD 1453. Ancient Rome became a territorial empire while still a republic, but was then ruled by Roman emperors beginning with Augustus, becoming the Roman Empire following the death of the last republican dictator, the first emperor's adoptive father Julius Caesar.
 Roman Republic              | 0.6280723810195923 | The Roman Republic was the era of classical Roman civilization beginning with the overthrow of the Roman Kingdom (traditionally dated to 509 BC) and ending in 27 BC with the establishment of the Roman Empire. During this period, Rome's control expanded from the city's immediate surroundings to hegemony over the entire Mediterranean world.
 Latin Empire                |  0.613895058631897 | The Latin Empire, also referred to as the Latin Empire of Constantinople, was a feudal Crusader state founded by the leaders of the Fourth Crusade on lands captured from the Byzantine Empire. The Latin Empire was intended to replace the Byzantine Empire as the Western-recognized Roman Empire in the east, with a Catholic emperor enthroned in place of the Eastern Orthodox Roman emperors. The main objective of the Latin Empire was planned by Venice, which promoted the creation of this state for their self-benefit.
 Augustus                    | 0.6021512150764465 | Gaius Julius Caesar Augustus (born Gaius Octavius; 23 September 63 BC – 19 August AD 14), also known as Octavianus or Octavian, was the founder of the Roman Empire; he reigned as the first Roman emperor from 27 BC until his death in AD 14. The reign of Augustus initiated an imperial cult as well as an era associated with imperial peace, the Pax Romana or Pax Augusta, in which the Roman world was largely free of armed conflict aside from expansionary wars and the Year of the Four Emperors. The Principate system of imperial rule established by Augustus lasted until the Crisis of the Third Century.
(5 rows)

Just like that, vector search without a local txtai instance! The only part needed is a way to vectorize the search query. This can easily be replaced with an API vectorization service.

We can also find the most similar rows for an existing row.

# Find top n results closest to an existing row
query = """
SELECT id, text FROM sections s \
JOIN vectors v ON s.indexid = v.indexid \
WHERE v.indexid != 738 ORDER by v.embedding <#> (SELECT embedding FROM vectors WHERE indexid=738) LIMIT 5
"""

!PGPASSWORD=pass psql -h localhost -U postgres -c "{query}"
                id                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       text                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 History of the Roman Empire      | The history of the Roman Empire covers the history of ancient Rome from the fall of the Roman Republic in 27 BC until the abdication of Romulus Augustulus in AD 476 in the West, and the Fall of Constantinople in the East in AD 1453. Ancient Rome became a territorial empire while still a republic, but was then ruled by Roman emperors beginning with Augustus, becoming the Roman Empire following the death of the last republican dictator, the first emperor's adoptive father Julius Caesar.
 Roman Republic                   | The Roman Republic was the era of classical Roman civilization beginning with the overthrow of the Roman Kingdom (traditionally dated to 509 BC) and ending in 27 BC with the establishment of the Roman Empire. During this period, Rome's control expanded from the city's immediate surroundings to hegemony over the entire Mediterranean world.
 Roman Kingdom                    | The Roman Kingdom, also referred to as the Roman monarchy or the regal period of ancient Rome, was the earliest period of Roman history when the city and its territory were ruled by kings. According to tradition, the Roman Kingdom began with the city's founding  753 BC, with settlements around the Palatine Hill along the river Tiber in central Italy, and ended with the overthrow of the kings and the establishment of the Republic  509 BC.
 Byzantine Empire                 | The Byzantine Empire, also referred to as the Eastern Roman Empire, was the continuation of the Roman Empire centered in Constantinople during Late Antiquity and the Middle Ages. The eastern half of the Empire survived the conditions that caused the fall of the West in the 5th century AD, and continued to exist until the fall of Constantinople to the Ottoman Empire in 1453. During most of its existence, the empire remained the most powerful economic, cultural, and military force in the Mediterranean world. The term "Byzantine Empire" was only coined following the empire's demise: its citizens referred to the polity as the "Roman Empire" and to themselves as "Romans". Due to the imperial seat's move from Rome to Byzantium, the adoption of state Christianity, and the predominance of Greek instead of Latin, modern historians continue to make a distinction between the earlier "Roman Empire" and the later "Byzantine Empire".
 Fall of the Western Roman Empire | The fall of the Western Roman Empire, also called the fall of the Roman Empire or the fall of Rome, was the loss of central political control in the Western Roman Empire, a process in which the Empire failed to enforce its rule, and its vast territory was divided into several successor polities. The Roman Empire lost the strengths that had allowed it to exercise effective control over its Western provinces; modern historians posit factors including the effectiveness and numbers of the army, the health and numbers of the Roman population, the strength of the economy, the competence of the emperors, the internal struggles for power, the religious changes of the period, and the efficiency of the civil administration. Increasing pressure from invading barbarians outside Roman culture also contributed greatly to the collapse. Climatic changes and both endemic and epidemic disease drove many of these immediate factors. The reasons for the collapse are major subjects of the historiography of the ancient world and they inform much modern discourse on state failure.
(5 rows)

Wrapping up

From prototyping to production, txtai is ready for all that can be thrown it's way. As always, this functionality is just the beginning and will continue to improve over time. But this is a big deal and has impacts for future services such as txtai.cloud ☁️

Stay tuned!