Benefits of hybrid search

Benefits of hybrid search

Improve accuracy with a combination of semantic and keyword search

Semantic search is a new category of search built on recent advances in Natural Language Processing (NLP). Traditional search systems use keywords to find data. Semantic search has an understanding of natural language and identifies results that have the same meaning, not necessarily the same keywords.

While semantic search adds amazing capabilities, sparse keyword indexes can still add value. There may be cases where finding an exact match is important or we just want a fast index to quickly do an initial scan of a dataset.

Both methods have their merits. What if we combine them together to build a unified hybrid search capability? Can we get the best of both worlds?

This article will explore the benefits of hybrid search.

Install dependencies

Install txtai and all dependencies.

# Install txtai
pip install txtai pytrec_eval rank-bm25 elasticsearch

Introducing semantic, keyword and hybrid search

Before diving into the benchmarks, let's briefly discuss how semantic and keyword search works.

Semantic search uses large language models to vectorize inputs into arrays of numbers. Similar concepts will have similar values. The vectors are typically stored in a vector database, which is a system that specializes in storing these numerical arrays and finding matches. Vector search transforms an input query into a vector and then runs a search to find the best conceptual results.

Keyword search tokenizes text into lists of tokens per document. These tokens are aggregated into token frequencies per document and stored in term frequency sparse arrays. At search time, the query is tokenized and the tokens of the query are compared to the tokens in the dataset. This is more a literal process. Keyword search is like string matching, it has no conceptual understanding, it matches on characters and bytes.

Hybrid search combines the scores from semantic and keyword indexes. Given that semantic search scores are typically 0 - 1 and keyword search scores are unbounded, a method is needed to combine the results.

The two methods supported in txtai are:

The default method in txtai is convex combination and we'll use that.

Evaluating performance

Now it's time to benchmark the results. For these tests, we'll use the BEIR dataset. We'll also use a benchmarks script from the txtai project. This benchmarks script has methods to work with the BEIR dataset.

We'll select a subset of the BEIR sources for brevity. For each source, we'll benchmark a bm25 index, an embeddings index and a hybrid or combined index.

import os

# Get benchmarks script
os.system("wget https://raw.githubusercontent.com/neuml/txtai/master/examples/benchmarks.py")

# Create output directory
os.makedirs("beir", exist_ok=True)

# Download subset of BEIR datasets
datasets = ["nfcorpus", "fiqa", "arguana", "scidocs", "scifact"]
for dataset in datasets:
  url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset}.zip"
  os.system(f"wget {url}")
  os.system(f"mv {dataset}.zip beir")
  os.system(f"unzip -d beir beir/{dataset}.zip")

  # Remove existing benchmark data
if os.path.exists("benchmarks.json"):
  os.remove("benchmarks.json")

Now let's run the benchmarks.

# Remove existing benchmark data
if os.path.exists("benchmarks.json"):
  os.remove("benchmarks.json")

# Runs benchmark evaluation
def evaluate(method):
  for dataset in datasets:
    command = f"python benchmarks.py beir {dataset} {method}"
    print(command)
    os.system(command)

# Calculate benchmarks
for method in ["bm25", "embed", "hybrid"]:
  evaluate(method)
import json
import pandas as pd

def benchmarks():
  # Read JSON lines data
  with open("benchmarks.json") as f:
    data = f.read()

  df = pd.read_json(data, lines=True).sort_values(by=["source", "ndcg_cut_10"], ascending=[True, False])
  return df[["source", "method", "ndcg_cut_10", "map_cut_10", "recall_10", "P_10", "index", "search", "memory"]].reset_index(drop=True)

# Load benchmarks dataframe
df = benchmarks()
df[df.source == "nfcorpus"].reset_index(drop=True)
sourcemethodndcg_cut_10map_cut_10recall_10P_10indexsearchmemory
nfcorpushybrid0.345310.133690.174370.2548029.463.572900
nfcorpusembed0.309170.108100.153270.2359133.643.332876
nfcorpusbm250.306390.117280.148910.217342.720.96652
df[df.source == "fiqa"].reset_index(drop=True)
sourcemethodndcg_cut_10map_cut_10recall_10P_10indexsearchmemory
fiqahybrid0.366420.288460.437990.10340233.9068.423073
fiqaembed0.360710.284500.431880.10216212.3058.832924
fiqabm250.235590.176190.298550.0655919.7812.8476
df[df.source == "arguana"].reset_index(drop=True)
sourcemethodndcg_cut_10map_cut_10recall_10P_10indexsearchmemory
arguanahybrid0.484670.401010.753200.0753237.8021.222924
arguanaembed0.477810.387810.766710.0766734.1110.212910
arguanabm250.457130.371180.734710.073473.3910.95663
df[df.source == "scidocs"].reset_index(drop=True)
sourcemethodndcg_cut_10map_cut_10recall_10P_10indexsearchmemory
scidocsembed0.217180.129820.232170.1146127.634.412929
scidocshybrid0.211040.124500.229380.1134138.006.432999
scidocsbm250.150630.087560.156370.077213.071.42722
df[df.source == "scifact"].reset_index(drop=True)
sourcemethodndcg_cut_10map_cut_10recall_10P_10indexsearchmemory
scifacthybrid0.713050.667730.837220.0936739.512.352918
scifactbm250.663240.617640.787610.087004.400.93658
scifactembed0.651490.601930.789720.0886735.151.482889

The sections above show the metrics per source and method.

The table headers list the source (dataset), index method, NDCG@10/MAP@10/RECALL@10/P@10 accuracy metrics, index time(s), search time(s) and memory usage(MB). The tables are sorted by NDCG@10 descending.

Looking at the results, we can see that hybrid search often performs better than embeddings or bm25 individually. In some cases, as with scidocs, the combination performs worse. But in the aggregate, the scores are better. This holds true for the entire BEIR dataset. For some sources, bm25 does best, some embeddings but overall the combined hybrid scores do the best.

Hybrid search isn't free though, it is slower as it has extra logic to combine the results. For individual queries, the results are often negligible.

Wrapping up

This article covered ways to improve search accuracy using a hybrid approach. We evaluated performance over a subset of the BEIR dataset to show how hybrid search, in many situations, can improve overall accuracy.

Custom datasets can also be evaluated using this method as specified in this link. This article and the associated benchmarks script can be reused to evaluate what method works best on your data.