Large Language Models (LLMs) have completely dominated the AI and machine learning space in 2023. The results have been amazing and the public imagination is almost endless.

While LLMs have been impressive, they are not problem free. The biggest challenge is with hallucinations. Hallucinations is the term for when a LLM generates output that is factually incorrect. The alarming part of this is that on a cursory glance, it actually sounds like good content. The default behavior of LLMs is to produce plausible answers even when no plausible answer exists. LLMs are not great at saying I don't know.

Retrieval augmented generation (RAG) helps reduce the risk of hallucinations by limiting the context in which a LLM can generate answers. This is typically done with a vector search query that hydrates a prompt with a relevant context. RAG is one of the most practical and production-ready use cases for Generative AI. It's so popular now, that some are creating their entire companies around it.

txtai has long had question-answering pipelines, which employ the same process of retrieving a relevant context. LLMs are now the preferred approach for analyzing that context and RAG pipelines are one of the main features of txtai. One of the other main features of txtai is that it's a vector database! You can build your prompts and limit your context all with one library. Hence the phrase all-in-one embeddings database.

This article shows how to build RAG pipelines with txtai.

Install dependencies

Install txtai and all dependencies. Since this article is using optional pipelines, we need to install the pipeline extras package.

# Install txtai
pip install txtai[pipeline] autoawq

# Get test data
wget -N https://github.com/neuml/txtai/releases/download/v6.2.0/tests.tar.gz
tar -xvzf tests.tar.gz

# Install NLTK
import nltk
nltk.download(['punkt', 'punkt_tab'])

Start with the basics

Let's jump right in and start with a simple LLM pipeline. The LLM pipeline supports local LLM models via Hugging Face Transformers and llama.cpp.

The LLM pipeline also supports API services (i.e. OpenAI, Claude, Bedrock etc) via LiteLLM. The LLM pipeline automatically detects the underlying LLM framework from the path parameter.

from txtai import LLM

# Create LLM
llm = LLM("TheBloke/Mistral-7B-OpenOrca-AWQ")

Next, we'll load a document to query. The Textractor pipeline has support for extracting text from common document formats (docx, pdf, xlsx).

from txtai.pipeline import Textractor

# Create Textractor
textractor = Textractor()
text = textractor("txtai/document.docx")
print(text)

txtai – the all-in-one embeddings database
txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

Summary of txtai features:
· Vector search with SQL, object storage, topic modeling
· Create embeddings for text, documents, audio, images and video
· Pipelines powered by language models that run LLM prompts
· Workflows to join pipelines together and aggregate business logic
· Build with Python or YAML. API bindings available for JavaScript, Java, Rust and Go.
· Run local or scale out with container orchestration

Examples
List of example notebooks.
|Notebook|Description|
|---|---|
|Introducing txtai |Overview of the functionality provided by txtai|
|Similarity search with images|Embed images and text into the same space for search|
|Build a QA database|Question matching with semantic search|
|Semantic Graphs|Explore topics, data connectivity and run network analysis|

Install
The easiest way to install is via pip and PyPI
pip install txtai
Python 3.9+ is supported. Using a Python virtual environment is recommended.
See the detailed install instructions for more information covering optional dependencies, environment specific prerequisites, installing from source, conda support and how to run with containers.

Model guide
The following shows a list of suggested models.
|Component|Model(s)|
|---|---|
|Embeddings|all-MiniLM-L6-v2|
||E5-base-v2|
|Image Captions|BLIP|
|Labels - Zero Shot|BART-Large-MNLI|
|Labels - Fixed|Fine-tune with training pipeline|
|Large Language Model (LLM)|Flan T5 XL|
||Mistral 7B OpenOrca|
|Summarization|DistilBART|
|Text-to-Speech|ESPnet JETS|
|Transcription|Whisper|
|Translation|OPUS Model Series|

Now we'll define a simple LLM pipeline. It takes a question and context (which in this case is the whole file), creates a prompt and runs it with the LLM.

def execute(question, text):
  prompt = f"""<|im_start|>system
  You are a friendly assistant. You answer questions from users.<|im_end|>
  <|im_start|>user
  Answer the following question using only the context below. Only include information specifically discussed.

  question: {question}
  context: {text} <|im_end|>
  <|im_start|>assistant
  """

  return llm(prompt, maxlength=4096, pad_token_id=32000)

execute("Tell me about txtai in one sentence", text)

Txtai is an all-in-one embeddings database for semantic search, LLM orchestration, and language model workflows, offering features such as vector search, pipeline creation, workflow management, and API bindings for various programming languages.

execute("What model does txtai recommend for transcription?", text)

The model that txtai recommends for transcription is Whisper.

execute("I don't know anything about txtai, what would be the best thing to read?", text)

The best thing to read to learn about txtai is the "Introducing txtai" notebook, which provides an overview of the functionality provided by txtai. This notebook covers various features such as vector search with SQL, object storage, topic modeling, creating embeddings for text, documents, audio, images, and video, and running language model workflows. Additionally, you can explore other example notebooks like "Similarity search with images," "Build a QA database," and "Semantic Graphs" to learn more about specific use cases and features. To install txtai, use pip and PyPI with Python 3.9+, and follow the detailed install instructions for more information on optional dependencies and environment-specific prerequisites.

If this is the first time you've seen Generative AI, then these statements are 🤯. Even if you've been in the space a while, it's still amazing how much a language model can understand and the high level of quality in it's answers.

While this use case is fun, lets try to scale it to a larger set of documents.

Before continuing, it's important to note that txtai has multiple ways to run LLM inference. In the past, prior to "Chat Templates", it was expected that the submitted text had all the required chat tokens embedded. The same prompt above can also be written with chat messages. This is especially important when working with LLM APIs (i.e. OpenAI, Claude, Bedrock etc).

llm([
    {"role": "system": "You are a friendly assistant. You answer questions from users."}
    {"role": "user", "content": f"""
        Answer the following question using only the context below. Only include information specifically discussed.

        question: {question}
        context: {text} 
    """}
])

Build a RAG pipeline with vector search

Let's say we have a large number of documents, hundreds/thousands etc. We can't just put all those documents into a single prompt, we'll run out of GPU memory fast!

This is where retrieval augmented generation enters the picture. We can use a query step that finds the best candidates to add to the prompt.

Typically, this candidate query uses vector search but it can be anything that runs a search and returns results. In fact, many complex production systems have customized retrieval pipelines that feed a context into LLM prompts.

The first step in building our RAG pipeline is creating the knowledge store. In this case, it's a vector database of file content. The files will be split into paragraphs with each paragraph stored as a separate row.

import os

from txtai import Embeddings

def stream(path):
  for f in sorted(os.listdir(path)):
    fpath = os.path.join(path, f)

    # Only accept documents
    if f.endswith(("docx", "xlsx", "pdf")):
      print(f"Indexing {fpath}")
      for paragraph in textractor(fpath):
        yield paragraph

# Document text extraction, split into paragraphs
textractor = Textractor(paragraphs=True)

# Vector Database
embeddings = Embeddings(content=True)
embeddings.index(stream("txtai"))

Indexing txtai/article.pdf
Indexing txtai/document.docx
Indexing txtai/document.pdf
Indexing txtai/spreadsheet.xlsx

The next step is defining the RAG pipeline. This pipeline takes the input question, runs a vector search and builds a context using the search results. The context is then inserted into a prompt template and run with the LLM.

def context(question):
  context =  "\n".join(x["text"] for x in embeddings.search(question))
  return context

def rag(question):
  return execute(question, context(question))

rag("What model does txtai recommend for image captioning?")

Based on the provided context, txtai recommends the model "BLIP" for image captioning.

result = rag("When was the BLIP model added for image captioning?")
print(result)

The BLIP model was added for image captioning on 2022-03-17.

As we can see, the result is similar to what we had before without vector search. The difference is that we only used a relevant portion of the documents to generate the answer.

As we discussed before, this is important when dealing with large volumes of data. Not all of the data can be added to a LLM prompt. Additionally, having only the most relevant context helps the LLM generate higher quality answers.

Citations for LLMs

A healthy level of skepticism should be applied to answers generated by AI. We're far from the day where we can blindly trust answers from an AI model.

txtai has a couple approaches for generating citations. The basic approach is to take the answer and search the vector database for the closest match.

for x in embeddings.search(result):
  print(x["text"])

E5-base-v2
Image Captions BLIP
Labels - Zero Shot BART-Large-MNLI
Model Guide
|Component |Model(s)|Date Added|
|---|---|---|
|Embeddings |all-MiniLM-L6-v2|2022-04-15|
|Image Captions |BLIP|2022-03-17|
|Labels - Zero Shot |BART-Large-MNLI|2022-01-01|
|Large Language Model (LLM) |Mistral 7B OpenOrca|2023-10-01|
|Summarization |DistilBART|2021-02-22|
|Text-to-Speech |ESPnet JETS|2022-08-01|
|Transcription |Whisper|2022-08-01|
|Translation |OPUS Model Series|2021-04-06|
&"Times New Roman,Regular"&12&A
Notebook Description
Introducing txtai Overview of the functionality provided by txtai
Similarity search with 
images Embed images and text into the same space for search

While the basic approach above works in this case, txtai has a more robust pipeline to handle citations and references.

The RAG pipeline is defined below. A RAG pipeline works in the same way as a LLM + Vector Search pipeline, except it has special logic for generating citations. This pipeline takes the answers and compares it to the context passed to the LLM to determine the most likely reference.

from txtai import RAG

# RAG prompt
def prompt(question):
  return [{
    "query": question,
    "question": f"""
Answer the following question using only the context below. Only include information specifically discussed.

question: {question}
context:
"""
}]

# Create LLM with system prompt template
llm = LLM("TheBloke/Mistral-7B-OpenOrca-AWQ", template="""<|im_start|>system
You are a friendly assistant. You answer questions from users.<|im_end|>
<|im_start|>user
{text} <|im_end|>
<|im_start|>assistant
""")

# Create RAG instance
rag = RAG(embeddings, llm, output="reference")

result = rag(prompt("What version of Python is supported?"), maxlength=4096, pad_token_id=32000)[0]
print("ANSWER:", result["answer"])
print("CITATION:", embeddings.search("select id, text from txtai where id = :id", limit=1, parameters={"id": result["reference"]}))

ANSWER: Python 3.9+ is supported. Using a Python virtual environment is recommended. The easiest way to install is via pip and PyPI.
CITATION: [{'id': '24', 'text': 'Python 3.9+ is supported. Using a Python virtual environment is recommended.'}]

And as we can see, not only is the answer to the statement shown, the RAG pipeline also provides a citation. This step is crucial in any line of work where answers must be verified (which is most lines of work).

As with the LLM pipeline, the RAG pipeline also supports chat messages. See the RAG pipeline documentation for more.

Wrapping up

This article introduced retrieval augmented generation (RAG), explained why we need it and showed the options available for running RAG pipelines with txtai.

The advantages of building RAG pipelines with txtai are:

All-in-one database - one library can handle LLM inference and vector search retrieval
Generating citations - generating answers is useful but referencing where those answers came from is crucial in gaining the trust of users
Simple yet powerful - building pipelines can be done in a small amount of Python. Options are available to build pipelines in YAML and/or run through the API

Build RAG pipelines with txtai

Guide on retrieval augmented generation including how to create citations

Install dependencies

Start with the basics

Build a RAG pipeline with vector search

Citations for LLMs

Wrapping up