txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.
There are many articles, notebooks and examples covering how to perform vector search and/or retrieval augmented generation (RAG) with txtai. A lesser known component of txtai is it's built-in workflow component.
Workflows are a simple yet powerful construct that takes a callable and returns elements. Workflows enable efficient processing of pipeline data. Workflows are streaming by nature and work on data in batches. This allows large volumes of data to be processed efficiently.
This article will demonstrate how to to build a Speech to Speech (S2S) workflow with txtai.
Note: This process is intended to run on local machines due to it's use of input and output audio devices.
Install dependencies
Install txtai
and all dependencies.
pip install txtai[pipeline-audio] autoawq
Define the S2S RAG Workflow
The next section defines the Speech to Speech (S2S) RAG workflow. The objective of this workflow is to respond to a user request in near real-time.
txtai supports workflow definitions in Python and with YAML. We'll cover both methods.
The S2S workflow below starts with a microphone pipeline, which streams and processes input audio. The microphone pipeline has voice activity detection (VAD) built-in. When speech is detected, the pipeline returns the captured audio data. Next, the speech is transcribed to text and then passed to a RAG pipeline prompt. Finally, the RAG result is run through a text to speech (TTS) pipeline and streamed to an output audio device.
import logging
from txtai import Embeddings, RAG
from txtai.pipeline import AudioStream, Microphone, TextToSpeech, Transcription
from txtai.workflow import Workflow, StreamTask, Task
# Enable DEBUG logging
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
# Microphone
microphone = Microphone()
# Transcription
transcribe = Transcription("distil-whisper/distil-large-v3")
# Embeddings database
embeddings = Embeddings()
embeddings.load(provider="huggingface-hub", container="neuml/txtai-wikipedia")
# Define prompt template
template = """
Answer the following question using only the context below. Only include information
specifically discussed. Answer the question without explaining how you found the answer.
question: {question}
context: {context}"""
# Create RAG pipeline
rag = RAG(
embeddings,
"hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
system="You are a friendly assistant. You answer questions from users.",
template=template,
context=10
)
# Text to speech
tts = TextToSpeech("neuml/vctk-vits-onnx")
# Audio stream
audiostream = AudioStream()
# Define speech to speech workflow
workflow = Workflow(tasks=[
Task(action=microphone),
Task(action=transcribe, unpack=False),
StreamTask(action=lambda x: rag(x, maxlength=4096, stream=True), batch=True),
StreamTask(action=lambda x: tts(x, stream=True, speaker=15), batch=True),
StreamTask(action=audiostream, batch=True)
])
while True:
print("Waiting for input...")
list(workflow([None]))
Given that the input and outputs are audio, you'll have to use your imagination if you're reading this as an article.
Check out this video to see the workflow in action! The following examples are run:
Tell me about the Roman Empire
Explain how faster than light travel could work
Write a short poem about the Vikings
Tell me about the Roman Empire in French
S2S Workflow in YAML
A crucial feature of txtai workflows is that they can be defined with YAML. This enables building workflows in a low-code and/or no-code setting. These YAML workflows can then be "dockerized" and run.
Let's define the same workflow below.
# Microphone
microphone:
# Transcription
transcription:
path: distil-whisper/distil-large-v3
# Embeddings database
cloud:
provider: huggingface-hub
container: neuml/txtai-wikipedia
embeddings:
# RAG
rag:
path: "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"
system: You are a friendly assistant. You answer questions from users.
template: |
Answer the following question using only the context below. Only include information
specifically discussed. Answer the question without explaining how you found the answer.
question: {question}
context: {context}
context: 10
# TTS
texttospeech:
path: neuml/vctk-vits-onnx
# AudioStream
audiostream:
# Speech to Speech Chat workflow
workflow:
s2s:
tasks:
- microphone
- action: transcription
unpack: False
- task: stream
action: rag
args:
maxlength: 4096
stream: True
batch: True
- task: stream
action: texttospeech
args:
stream: True
speaker: 15
batch: True
- task: stream
action: audiostream
batch: True
from txtai import Application
app = Application("s2s.yml")
while True:
print("Waiting for input...")
list(app.workflow("s2s", [None]))
Once again, the same idea, just a different way to do it. In the video demo, the following query was asked.
As a Patriots fan, who would you guess is my favorite quarterback of all time is?
I'm tall and run fast, what do you think the best soccer position for me is?
I run slow, what do you think the best soccer position for me is?
With YAML workflows, it's possible to fully define the process outside of code such as with a web interface. Perhaps someday we'll see this with txtai.cloud ๐
Wrapping up
This article demonstrated how to build a Speech to Speech (S2S) workflow with txtai. While the workflow uses an off-the-shelf embeddings database, a custom embeddings database can easily be swapped in. From there, we have S2S with our own data!