Extractive QA with Elasticsearch

Run extractive question-answering queries with Elasticsearch

txtai is datastore agnostic, the library analyzes sets of text. The following example shows how extractive question-answering can be added on top of an Elasticsearch system.

Install dependencies

Install txtai and Elasticsearch.

# Install txtai and elasticsearch python client
pip install txtai elasticsearch

# Download and extract elasticsearch
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz
tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.10.1

Start an instance of Elasticsearch.

import os
from subprocess import Popen, PIPE, STDOUT

# Start and wait for server
server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))
sleep 30

Download data

This example is going to work off a subset of the CORD-19 dataset. COVID-19 Open Research Dataset (CORD-19) is a free resource of scholarly articles, aggregated by a coalition of leading research groups, covering COVID-19 and the coronavirus family of viruses.

The following download is a SQLite database generated from a Kaggle notebook. More information on this data format, can be found in the CORD-19 Analysis notebook.

wget https://github.com/neuml/txtai/releases/download/v1.1.0/tests.gz
gunzip tests.gz
mv tests articles.sqlite

Load data into Elasticsearch

The following block copies rows from SQLite to Elasticsearch.

import sqlite3

import regex as re

from elasticsearch import Elasticsearch, helpers

# Connect to ES instance
es = Elasticsearch(hosts=["http://localhost:9200"], timeout=60, retry_on_timeout=True)

# Connection to database file
db = sqlite3.connect("articles.sqlite")
cur = db.cursor()

# Elasticsearch bulk buffer
buffer = []
rows = 0

# Select tagged sentences without a NLP label. NLP labels are set for non-informative sentences.
cur.execute("SELECT s.Id, Article, Title, Published, Reference, Name, Text FROM sections s JOIN articles a on s.article=a.id WHERE (s.labels is null or s.labels NOT IN ('FRAGMENT', 'QUESTION')) AND s.tags is not null")
for row in cur:
  # Build dict of name-value pairs for fields
  article = dict(zip(("id", "article", "title", "published", "reference", "name", "text"), row))
  name = article["name"]

  # Only process certain document sections
  if not name or not re.search(r"background|(?<!.*?results.*?)discussion|introduction|reference", name.lower()):
    # Bulk action fields
    article["_id"] = article["id"]
    article["_index"] = "articles"

    # Buffer article
    buffer.append(article)

    # Increment number of articles processed
    rows += 1

    # Bulk load every 1000 records
    if rows % 1000 == 0:
      helpers.bulk(es, buffer)
      buffer = []

      print("Inserted {} articles".format(rows), end="\r")

if buffer:
  helpers.bulk(es, buffer)

print("Total articles inserted: {}".format(rows))
Total articles inserted: 21499

Query data

The following runs a query against Elasticsearch for the terms "risk factors". It finds the top 5 matches and returns the corresponding documents associated with each match.

import pandas as pd

from IPython.display import display, HTML

pd.set_option("display.max_colwidth", None)

query = {
    "_source": ["article", "title", "published", "reference", "text"],
    "size": 5,
    "query": {
        "query_string": {"query": "risk factors"}
    }
}

results = []
for result in es.search(index="articles", body=query)["hits"]["hits"]:
  source = result["_source"]
  results.append((source["title"], source["published"], source["reference"], source["text"]))

df = pd.DataFrame(results, columns=["Title", "Published", "Reference", "Match"])

display(HTML(df.to_html(index=False)))
TitlePublishedReferenceMatch
Management of osteoarthritis during COVID‐19 pandemic2020-05-21 00:00:00doi.org/10.1002/cpt.1910Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .
Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection2020-04-24 00:00:00medrxiv.org/cgi/content/short/2020.04.20.20..This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors.
Does apolipoprotein E genotype predict COVID-19 severity?2020-04-27 00:00:00doi.org/10.1093/qjmed/hcaa142Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors .
COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants2020-07-23 00:00:00ncbi.nlm.nih.gov/pubmed/32705587BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease.
COVID-19: what has been learned and to be learned about the novel coronavirus disease2020-03-15 00:00:00doi.org/10.7150/ijbs.45134• Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.

Derive columns with Extractive QA

The next section uses Extractive QA to derive additional columns. For each article, the full text is retrieved and a series of questions are asked of the document. The answers are added as a derived column per article.

from txtai.embeddings import Embeddings
from txtai.pipeline import Extractor

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})

# Create extractor instance using qa model designed for the CORD-19 dataset
extractor = Extractor(embeddings, "NeuML/bert-small-cord19qa")
document = {
    "_source": ["id", "name", "text"],
    "size": 1000,
    "query": {
        "term": {"article": None}
    },
    "sort" : ["id"]
}

def sections(article):
  rows = []

  search = document.copy()
  search["query"]["term"]["article"] = article

  for result in es.search(index="articles", body=search)["hits"]["hits"]:
    source = result["_source"]
    name, text = source["name"], source["text"]

    if not name or not re.search(r"background|(?<!.*?results.*?)discussion|introduction|reference", name.lower()):
      rows.append(text)

  return rows

results = []
for result in es.search(index="articles", body=query)["hits"]["hits"]:
  source = result["_source"]

  # Use QA extractor to derive additional columns
  answers = extractor([("Risk factors", "risk factor", "What are names of risk factors?", False),
                       ("Locations", "city country state", "What are names of locations?", False)], sections(source["article"]))

  results.append((source["title"], source["published"], source["reference"], source["text"]) + tuple([answer[1] for answer in answers]))

df = pd.DataFrame(results, columns=["Title", "Published", "Reference", "Match", "Risk Factors", "Locations"])

display(HTML(df.to_html(index=False)))
TitlePublishedReferenceMatchRisk FactorsLocations
Management of osteoarthritis during COVID‐19 pandemic2020-05-21 00:00:00doi.org/10.1002/cpt.1910Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .Comorbiditiesextrapulmonary sites
Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection2020-04-24 00:00:00medrxiv.org/cgi/content/short/2020.04.20.20..This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors.CVD, risk factors but no CVD, and neither CVDNone
Does apolipoprotein E genotype predict COVID-19 severity?2020-04-27 00:00:00doi.org/10.1093/qjmed/hcaa142Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors .socioeconomic inequalities and risk factorsNone
COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants2020-07-23 00:00:00ncbi.nlm.nih.gov/pubmed/32705587BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease.Frailty and multimorbiditycomorbidity groupings
COVID-19: what has been learned and to be learned about the novel coronavirus disease2020-03-15 00:00:00doi.org/10.7150/ijbs.45134• Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.age and underlying disease are strongly correlatedcities, provinces, and countries