Extractive QA to build structured data

Build structured datasets using extractive question-answering

Traditional ETL/data parsing systems establish rules to extract information of interest. Regular expressions, string parsing and similar methods define fixed rules. This works in many cases but what if you are working with unstructured data containing numerous variations? The rules can be cumbersome and hard to maintain over time.

This notebook uses machine learning and extractive question-answering (QA) to utilize the vast knowledge built into large language models. These models have been trained on extremely large datasets, learning the many variations of natural language.

Install dependencies

Install txtai and all dependencies.

pip install txtai[pipeline-train]

Train a QA model with few-shot learning

The code below trains a new QA model using a few examples. These examples gives the model hints on the type of questions that will be asked and the type of answers to look for. It doesn't take a lot of examples to do this as shown below.

import pandas as pd
from txtai.pipeline import HFTrainer, Questions, Labels

# Training data for few-shot learning
data = [
    {"question": "What is the url?",
     "context": "Faiss (https://github.com/facebookresearch/faiss) is a library for efficient similarity search.",
     "answers": "https://github.com/facebookresearch/faiss"},
    {"question": "What is the url", "context": "The last release was Wed Sept 25 2021", "answers": None},
    {"question": "What is the date?", "context": "The last release was Wed Sept 25 2021", "answers": "Wed Sept 25 2021"},
    {"question": "What is the date?", "context": "The order total comes to $44.33", "answers": None},
    {"question": "What is the amount?", "context": "The order total comes to $44.33", "answers": "$44.33"},
    {"question": "What is the amount?", "context": "The last release was Wed Sept 25 2021", "answers": None},
]

# Fine-tune QA model
trainer = HFTrainer()
model, tokenizer = trainer("distilbert-base-cased-distilled-squad", data, task="question-answering")

Parse data into a structured table

The next section takes a series of rows of text and runs a set of questions against each row. The answers are then used to build a pandas DataFrame.

# Input data
context = ["Released on 6/03/2021",
           "Release delayed until the 11th of August",
           "Documentation can be found here: neuml.github.io/txtai",
           "The stock price fell to three dollars",
           "Great day: closing price for March 23rd is $33.11, for details - https://finance.google.com"]

# Define column queries
queries = ["What is the url?", "What is the date?", "What is the amount?"]

# Extract fields
questions = Questions(path=(model, tokenizer), gpu=True)
results = [questions([question] * len(context), context) for question in queries]
results.append(context)

# Load into DataFrame
pd.DataFrame(list(zip(*results)), columns=["URL", "Date", "Amount", "Text"])
URLDateAmountText
0None6/03/2021NoneReleased on 6/03/2021
1None11th of AugustNoneRelease delayed until the 11th of August
2neuml.github.io/txtaiNoneNoneDocumentation can be found here: neuml.github....
3NoneNonethree dollarsThe stock price fell to three dollars
4finance.google.comMarch 23rd$33.11Great day: closing price for March 23rd is $33...

Add additional columns

This method can be combined with other models to categorize, group or otherwise derive additional columns. The code below derives an additional sentiment column.

# Add sentiment
labels = Labels(path="distilbert-base-uncased-finetuned-sst-2-english", dynamic=False)
labels = ["POSITIVE" if x[0][0] == 1 else "NEGATIVE" for x in labels(context)]
results.insert(len(results) - 1, labels)

# Load into DataFrame
pd.DataFrame(list(zip(*results)), columns=["URL", "Date", "Amount", "Sentiment", "Text"])
URLDateAmountSentimentText
0None6/03/2021NonePOSITIVEReleased on 6/03/2021
1None11th of AugustNoneNEGATIVERelease delayed until the 11th of August
2neuml.github.io/txtaiNoneNoneNEGATIVEDocumentation can be found here: neuml.github....
3NoneNonethree dollarsNEGATIVEThe stock price fell to three dollars
4finance.google.comMarch 23rd$33.11POSITIVEGreat day: closing price for March 23rd is $33...