Extractive QA with txtai

Introduction to extractive question-answering with txtai

In Parts 1 through 4, we gave a general overview of txtai, the backing technology and examples of how to use it for similarity searches. This article builds on that and extends to building extractive question-answering systems.

Install dependencies

Install txtai and all dependencies.

pip install txtai

Create an Embeddings and Extractor instances

The Embeddings instance is the main entrypoint for txtai. An Embeddings instance defines the method used to tokenize and convert a segment of text into an embeddings vector.

The Extractor instance is the entrypoint for extractive question-answering.

Both the Embeddings and Extractor instances take a path to a transformer model. Any model on the Hugging Face model hub can be used in place of the models below.

from txtai.embeddings import Embeddings
from txtai.pipeline import Extractor

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})

# Create extractor instance
extractor = Extractor(embeddings, "distilbert-base-cased-distilled-squad")
data = ["Giants hit 3 HRs to down Dodgers",
        "Giants 5 Dodgers 4 final",
        "Dodgers drop Game 2 against the Giants, 5-4",
        "Blue Jays beat Red Sox final score 2-1",
        "Red Sox lost to the Blue Jays, 2-1",
        "Blue Jays at Red Sox is over. Score: 2-1",
        "Phillies win over the Braves, 5-0",
        "Phillies 5 Braves 0 final",
        "Final: Braves lose to the Phillies in the series opener, 5-0",
        "Lightning goaltender pulled, lose to Flyers 4-1",
        "Flyers 4 Lightning 1 final",
        "Flyers win 4-1"]

questions = ["What team won the game?", "What was score?"]

execute = lambda query: extractor([(question, query, question, False) for question in questions], data)

for query in ["Red Sox - Blue Jays", "Phillies - Braves", "Dodgers - Giants", "Flyers - Lightning"]:
    print("----", query, "----")
    for answer in execute(query):
        print(answer)
    print()

# Ad-hoc questions
question = "What hockey team won?"

print("----", question, "----")
print(extractor([(question, question, question, False)], data))
---- Red Sox - Blue Jays ----
('What team won the game?', 'Blue Jays')
('What was score?', '2-1')

---- Phillies - Braves ----
('What team won the game?', 'Phillies')
('What was score?', '5-0')

---- Dodgers - Giants ----
('What team won the game?', 'Giants')
('What was score?', '5-4')

---- Flyers - Lightning ----
('What team won the game?', 'Flyers')
('What was score?', '4-1')

---- What hockey team won? ----
[('What hockey team won?', 'Flyers')]