Build an Embeddings index with Hugging Face Datasets

Index and search Hugging Face Datasets

This article shows how txtai can index and search with Hugging Face's Datasets library. Datasets opens access to a large and growing list of publicly available datasets. Datasets has functionality to select, transform and filter data stored in each dataset.

In this example, txtai will be used to index and query a dataset.

Install dependencies

Install txtai and all dependencies. Also install datasets.

pip install txtai
pip install datasets

Load dataset and build a txtai index

In this example, we'll load the ag_news dataset, which is a collection of news article headlines. This only takes a single line of code!

Next, txtai will index the first 10,000 rows of the dataset. A sentence similarity model is used to compute sentence embeddings. sentence-transformers has a number of pre-trained models that can be swapped in.

In addition to the embeddings index, we'll also create a Similarity instance to re-rank search hits for relevancy.

from datasets import load_dataset

from txtai.embeddings import Embeddings
from txtai.pipeline import Similarity

def stream(dataset, field, limit):
  index = 0
  for row in dataset:
    yield (index, row[field], None)
    index += 1

    if index >= limit:
      break

def search(query):
  return [(result["score"], result["text"]) for result in embeddings.search(query, limit=50)]

def ranksearch(query):
  results = [text for _, text in search(query)]
  return [(score, results[x]) for x, score in similarity(query, results)]

# Load HF dataset
dataset = load_dataset("ag_news", split="train")

# Create embeddings model, backed by sentence-transformers & transformers, enable content storage
embeddings = Embeddings({"path": "sentence-transformers/paraphrase-MiniLM-L3-v2", "content": True})
embeddings.index(stream(dataset, "text", 10000))

# Create similarity instance for re-ranking
similarity = Similarity("valhalla/distilbart-mnli-12-3")

Search the dataset

Now that an index is ready, let's search the data! The following section runs a series of queries and show the results. Like basic search engines, txtai finds token matches. But the real power of txtai is finding semantically similar results.

sentence-transformers has a great overview on information retrieval that is well worth a read.

from IPython.core.display import display, HTML

def table(query, rows):
    html = """
    <style type='text/css'>
    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');
    table {
      border-collapse: collapse;
      width: 900px;
    }
    th, td {
        border: 1px solid #9e9e9e;
        padding: 10px;
        font: 15px Oswald;
    }
    </style>
    """

    html += "<h3>%s</h3><table><thead><tr><th>Score</th><th>Text</th></tr></thead>" % (query)
    for score, text in rows:
        html += "<tr><td>%.4f</td><td>%s</td></tr>" % (score, text)
    html += "</table>"

    display(HTML(html))

for query in ["Positive Apple reports", "Negative Apple reports", "Best planets to explore for life", "LA Dodgers good news", "LA Dodgers bad news"]:
  table(query, ranksearch(query)[:2])

Positive Apple reports

ScoreText
0.9886Apple tops US consumer satisfaction Recent data published by the American Customer Satisfaction Index (ACSI) shows Apple leading the consumer computer industry with the the highest customer satisfaction.
0.9876Apple Remote Desktop 2 Reviewing Apple Remote Desktop 2 in Computerworld, Yuval Kossovsky writes, #147;I liked what I found. #148; He concludes, #147;I am happy to say that ARD 2 is an excellent upgrade and well worth the money. #148; Aug 19

Negative Apple reports

ScoreText
0.9847Apple Recalls 28,000 Faulty Batteries Sold with 15-inch PowerBook Apple has had to recall up to 28,000 notebook batteries that were sold for use with their 15-inch PowerBook. Apple reports that faulty batteries sold between January 2004 and August 2004 can overheat and pose a fire hazard.
0.9733Apple warns about bad batteries Apple is recalling 28,000 faulty batteries for its 15-inch Powerbook G4 laptops.

Best planets to explore for life

ScoreText
0.9110Tiny 'David' Telescope Finds 'Goliath' Planet A newfound planet detected by a small, 4-inch-diameter telescope demonstrates that we are at the cusp of a new age of planet discovery. Soon, new worlds may be located at an accelerating pace, bringing the detection of the first Earth-sized world one step closer.
0.8705Life on Mars Likely, Scientist Claims (SPACE.com) SPACE.com - DENVER, COLORADO -- Those twin robots hard at work on Mars have transmitted teasing views that reinforce the prospect that microbial life may exist on the red planet.

LA Dodgers good news

ScoreText
0.9990Green's Slam Lifts L.A. Shawn Green connects on a grand slam and a solo homer to lead the Los Angeles Dodgers past the Atlanta Braves 7-4 on Saturday.
0.9961Dodgers 7, Braves 4 Los Angeles, Ca. -- Shawn Green belted a grand slam and a solo homer as Los Angeles beat Mike Hampton and the Atlanta Braves 7-to-4 Saturday afternoon.

LA Dodgers bad news

ScoreText
0.9880Expos Keep Dodgers at Bay With 8-7 Win (AP) AP - Giovanni Carrara walked Juan Rivera with the bases loaded and two outs in the ninth inning Monday night, spoiling Los Angeles' six-run comeback and handing the Montreal Expos an 8-7 victory over the Dodgers.
0.9671Gagne blows his 2d save Pinch-hitter Lenny Harris delivered a three-run double off Eric Gagne with two outs in the ninth, rallying the Florida Marlins past the Dodgers, 6-4, last night in Los Angeles.