Build an Embeddings index with Hugging Face Datasets
Index and search Hugging Face Datasets
This article shows how txtai can index and search with Hugging Face's Datasets library. Datasets opens access to a large and growing list of publicly available datasets. Datasets has functionality to select, transform and filter data stored in each dataset.
In this example, txtai will be used to index and query a dataset.
Install dependencies
Install txtai
and all dependencies. Also install datasets
.
pip install txtai
pip install datasets
Load dataset and build a txtai index
In this example, we'll load the ag_news
dataset, which is a collection of news article headlines. This only takes a single line of code!
Next, txtai will index the first 10,000 rows of the dataset. A sentence similarity model is used to compute sentence embeddings. sentence-transformers has a number of pre-trained models that can be swapped in.
In addition to the embeddings index, we'll also create a Similarity instance to re-rank search hits for relevancy.
from datasets import load_dataset
from txtai.embeddings import Embeddings
from txtai.pipeline import Similarity
def stream(dataset, field, limit):
index = 0
for row in dataset:
yield (index, row[field], None)
index += 1
if index >= limit:
break
def search(query):
return [(result["score"], result["text"]) for result in embeddings.search(query, limit=50)]
def ranksearch(query):
results = [text for _, text in search(query)]
return [(score, results[x]) for x, score in similarity(query, results)]
# Load HF dataset
dataset = load_dataset("ag_news", split="train")
# Create embeddings model, backed by sentence-transformers & transformers, enable content storage
embeddings = Embeddings({"path": "sentence-transformers/paraphrase-MiniLM-L3-v2", "content": True})
embeddings.index(stream(dataset, "text", 10000))
# Create similarity instance for re-ranking
similarity = Similarity("valhalla/distilbart-mnli-12-3")
Search the dataset
Now that an index is ready, let's search the data! The following section runs a series of queries and show the results. Like basic search engines, txtai finds token matches. But the real power of txtai is finding semantically similar results.
sentence-transformers has a great overview on information retrieval that is well worth a read.
from IPython.core.display import display, HTML
def table(query, rows):
html = """
<style type='text/css'>
@import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');
table {
border-collapse: collapse;
width: 900px;
}
th, td {
border: 1px solid #9e9e9e;
padding: 10px;
font: 15px Oswald;
}
</style>
"""
html += "<h3>%s</h3><table><thead><tr><th>Score</th><th>Text</th></tr></thead>" % (query)
for score, text in rows:
html += "<tr><td>%.4f</td><td>%s</td></tr>" % (score, text)
html += "</table>"
display(HTML(html))
for query in ["Positive Apple reports", "Negative Apple reports", "Best planets to explore for life", "LA Dodgers good news", "LA Dodgers bad news"]:
table(query, ranksearch(query)[:2])
Positive Apple reports
Score | Text |
0.9886 | Apple tops US consumer satisfaction Recent data published by the American Customer Satisfaction Index (ACSI) shows Apple leading the consumer computer industry with the the highest customer satisfaction. |
0.9876 | Apple Remote Desktop 2 Reviewing Apple Remote Desktop 2 in Computerworld, Yuval Kossovsky writes, #147;I liked what I found. #148; He concludes, #147;I am happy to say that ARD 2 is an excellent upgrade and well worth the money. #148; Aug 19 |
Negative Apple reports
Score | Text |
0.9847 | Apple Recalls 28,000 Faulty Batteries Sold with 15-inch PowerBook Apple has had to recall up to 28,000 notebook batteries that were sold for use with their 15-inch PowerBook. Apple reports that faulty batteries sold between January 2004 and August 2004 can overheat and pose a fire hazard. |
0.9733 | Apple warns about bad batteries Apple is recalling 28,000 faulty batteries for its 15-inch Powerbook G4 laptops. |
Best planets to explore for life
Score | Text |
0.9110 | Tiny 'David' Telescope Finds 'Goliath' Planet A newfound planet detected by a small, 4-inch-diameter telescope demonstrates that we are at the cusp of a new age of planet discovery. Soon, new worlds may be located at an accelerating pace, bringing the detection of the first Earth-sized world one step closer. |
0.8705 | Life on Mars Likely, Scientist Claims (SPACE.com) SPACE.com - DENVER, COLORADO -- Those twin robots hard at work on Mars have transmitted teasing views that reinforce the prospect that microbial life may exist on the red planet. |
LA Dodgers good news
Score | Text |
0.9990 | Green's Slam Lifts L.A. Shawn Green connects on a grand slam and a solo homer to lead the Los Angeles Dodgers past the Atlanta Braves 7-4 on Saturday. |
0.9961 | Dodgers 7, Braves 4 Los Angeles, Ca. -- Shawn Green belted a grand slam and a solo homer as Los Angeles beat Mike Hampton and the Atlanta Braves 7-to-4 Saturday afternoon. |
LA Dodgers bad news
Score | Text |
0.9880 | Expos Keep Dodgers at Bay With 8-7 Win (AP) AP - Giovanni Carrara walked Juan Rivera with the bases loaded and two outs in the ninth inning Monday night, spoiling Los Angeles' six-run comeback and handing the Montreal Expos an 8-7 victory over the Dodgers. |
0.9671 | Gagne blows his 2d save Pinch-hitter Lenny Harris delivered a three-run double off Eric Gagne with two outs in the ninth, rallying the Florida Marlins past the Dodgers, 6-4, last night in Los Angeles. |