External vectorization

External vectorization

Vectorization with precomputed embeddings datasets and APIs


4 min read

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

Vectorization is the process of transforming data into numbers using machine learning models. Input data is run through a model and fixed dimension vectors are returned. These vectors can then be loaded into a vector database for similarity search.

txtai is an open-source first system. Given it's own open-source roots, like-minded projects such as sentence-transformers are prioritized during development. But that doesn't mean txtai can't work with Embeddings API services.

This article will show to use txtai with external vectorization.

Install dependencies

Install txtai and all dependencies.

# Install txtai
pip install txtai

Create an Embeddings dataset

The first thing we'll do is pre-compute an embeddings dataset. In addition to Embeddings APIs, this can also be used during internal testing to tune index and database settings.

from txtai import Embeddings

# Load dataset
wikipedia = Embeddings()
wikipedia.load(provider="huggingface-hub", container="neuml/txtai-wikipedia")

# Query for Top 10,000 most popular articles
query = """
SELECT id, text FROM txtai
order by percentile desc
LIMIT 10000

data = wikipedia.search(query)

# Encode vectors using same vector model as Wikipedia
vectors = wikipedia.batchtransform(x["text"] for x in data)

# Build dataset of id, text, embeddings
dataset = []
for i, row in enumerate(data):
  dataset.append({"id": row["id"], "article": row["text"], "embeddings": vectors[i]})

Build an Embeddings index with external vectors

Next, we'll create an Embedding index with an external transform function set.

The external transform function can be any function or callable object. This function takes an array of data and returns an array of embeddings.

def transform(inputs):
  return wikipedia.batchtransform(inputs)

def stream():
  for row in dataset:
    # Index vector
    yield row["id"], row["embeddings"]

    # Index metadata
    yield {"id": row["id"], "article": row["article"]}

embeddings = Embeddings(transform="__main__.transform", content=True)

๐Ÿš€ Notice how fast creating the index was compared to indexing. This is because there is no vectorization! Now let's run a query.

embeddings.search("select id, article, score from txtai where similar(:x)", parameters={"x": "operating system"})
[{'id': 'Operating system',
  'article': 'An operating system (OS) is system software that manages computer hardware and software resources, and provides common services for computer programs.',
  'score': 0.8955847024917603},
 {'id': 'MacOS',
  'article': "macOS (;), originally Mac\xa0OS\xa0X, previously shortened as OS\xa0X, is an operating system developed and marketed by Apple Inc. since 2001. It is the primary operating system for Apple's Mac computers. Within the market of desktop and laptop computers, it is the second most widely used desktop OS, after Microsoft Windows and ahead of all Linux distributions, including ChromeOS.",
  'score': 0.8666583299636841},
 {'id': 'Linux',
  'article': 'Linux is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution (distro), which includes the kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses and recommends the name "GNU/Linux" to emphasize the use and importance of GNU software in many distributions, causing some controversy.',
  'score': 0.839817225933075}]

All as expected! This method can also be used with existing datasets on the Hugging Face Hub.

Integrate with Embeddings API services

Next, we'll integrate with an Embeddings API service to build vectors.

The code below interfaces with the Hugging Face Inference API. This can easily be switched to OpenAI, Cohere or even your own local API.

import numpy as np
import requests

BASE = "https://api-inference.huggingface.co/pipeline/feature-extraction"

def transform(inputs):
  # Your API provider of choice
  response = requests.post(f"{BASE}/sentence-transformers/nli-mpnet-base-v2", json={"inputs": inputs})
  return np.array(response.json(), dtype=np.float32)

data = [
  "US tops 5 million confirmed virus cases",
  "Canada's last fully intact ice shelf has suddenly collapsed, " +
  "forming a Manhattan-sized iceberg",
  "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
  "The National Park Service warns against sacrificing slower friends " +
  "in a bear attack",
  "Maine man wins $1M from $25 lottery ticket",
  "Make huge profits without work, earn up to $100,000 a day"

embeddings = Embeddings({"transform": transform, "backend": "numpy", "content": True})
embeddings.search("feel good story", 1)
[{'id': '4',
  'text': 'Maine man wins $1M from $25 lottery ticket',
  'score': 0.08329013735055923}]

This is the classic txtai tutorial example. Except this time, vectorization is run with an external API service!

Wrapping up

This article showed how txtai can integrate with external vectorization. This can be a dataset with pre-computed embeddings and/or an Embeddings API service.

Each of txtai's components can be fully customized and vectorization is no exception. Flexibility and customization for the win!