Speech to Speech RAG

Speech to Speech RAG

Full cycle speech to speech workflow with RAG

ยท

4 min read

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

There are many articles, notebooks and examples covering how to perform vector search and/or retrieval augmented generation (RAG) with txtai. A lesser known component of txtai is it's built-in workflow component.

Workflows are a simple yet powerful construct that takes a callable and returns elements. Workflows enable efficient processing of pipeline data. Workflows are streaming by nature and work on data in batches. This allows large volumes of data to be processed efficiently.

This article will demonstrate how to to build a Speech to Speech (S2S) workflow with txtai.

Note: This process is intended to run on local machines due to it's use of input and output audio devices.

Install dependencies

Install txtai and all dependencies.

pip install txtai[pipeline-audio] autoawq

Define the S2S RAG Workflow

The next section defines the Speech to Speech (S2S) RAG workflow. The objective of this workflow is to respond to a user request in near real-time.

txtai supports workflow definitions in Python and with YAML. We'll cover both methods.

The S2S workflow below starts with a microphone pipeline, which streams and processes input audio. The microphone pipeline has voice activity detection (VAD) built-in. When speech is detected, the pipeline returns the captured audio data. Next, the speech is transcribed to text and then passed to a RAG pipeline prompt. Finally, the RAG result is run through a text to speech (TTS) pipeline and streamed to an output audio device.

import logging

from txtai import Embeddings, RAG
from txtai.pipeline import AudioStream, Microphone, TextToSpeech, Transcription
from txtai.workflow import Workflow, StreamTask, Task

# Enable DEBUG logging
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)

# Microphone
microphone = Microphone()

# Transcription
transcribe = Transcription("distil-whisper/distil-large-v3")

# Embeddings database
embeddings = Embeddings()
embeddings.load(provider="huggingface-hub", container="neuml/txtai-wikipedia")

# Define prompt template
template = """
Answer the following question using only the context below. Only include information
specifically discussed. Answer the question without explaining how you found the answer.

question: {question}
context: {context}"""

# Create RAG pipeline
rag = RAG(
    embeddings,
    "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
    system="You are a friendly assistant. You answer questions from users.",
    template=template,
    context=10
)

# Text to speech
tts = TextToSpeech("neuml/vctk-vits-onnx")

# Audio stream
audiostream = AudioStream()

# Define speech to speech workflow
workflow = Workflow(tasks=[
    Task(action=microphone),
    Task(action=transcribe, unpack=False),
    StreamTask(action=lambda x: rag(x, maxlength=4096, stream=True), batch=True),
    StreamTask(action=lambda x: tts(x, stream=True, speaker=15), batch=True),
    StreamTask(action=audiostream, batch=True)
])

while True:
    print("Waiting for input...")
    list(workflow([None]))

Given that the input and outputs are audio, you'll have to use your imagination if you're reading this as an article.

Check out this video to see the workflow in action! The following examples are run:

  • Tell me about the Roman Empire

  • Explain how faster than light travel could work

  • Write a short poem about the Vikings

  • Tell me about the Roman Empire in French

S2S Workflow in YAML

A crucial feature of txtai workflows is that they can be defined with YAML. This enables building workflows in a low-code and/or no-code setting. These YAML workflows can then be "dockerized" and run.

Let's define the same workflow below.

# Microphone
microphone:

# Transcription
transcription:
  path: distil-whisper/distil-large-v3

# Embeddings database
cloud:
  provider: huggingface-hub
  container: neuml/txtai-wikipedia

embeddings:

# RAG
rag:
  path: "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"
  system: You are a friendly assistant. You answer questions from users.
  template: |
    Answer the following question using only the context below. Only include information
    specifically discussed. Answer the question without explaining how you found the answer.

    question: {question}
    context: {context}
  context: 10

# TTS
texttospeech:
  path: neuml/vctk-vits-onnx

# AudioStream
audiostream:

# Speech to Speech Chat workflow
workflow:
  s2s:
    tasks:
      - microphone
      - action: transcription
        unpack: False
      - task: stream
        action: rag
        args:
          maxlength: 4096
          stream: True
        batch: True
      - task: stream
        action: texttospeech
        args:
          stream: True
          speaker: 15
        batch: True
      - task: stream
        action: audiostream
        batch: True
from txtai import Application

app = Application("s2s.yml")
while True:
    print("Waiting for input...")
    list(app.workflow("s2s", [None]))

Once again, the same idea, just a different way to do it. In the video demo, the following query was asked.

  • As a Patriots fan, who would you guess is my favorite quarterback of all time is?

  • I'm tall and run fast, what do you think the best soccer position for me is?

  • I run slow, what do you think the best soccer position for me is?

With YAML workflows, it's possible to fully define the process outside of code such as with a web interface. Perhaps someday we'll see this with txtai.cloud ๐Ÿ˜€

Wrapping up

This article demonstrated how to build a Speech to Speech (S2S) workflow with txtai. While the workflow uses an off-the-shelf embeddings database, a custom embeddings database can easily be swapped in. From there, we have S2S with our own data!

ย